Panic in Rust Code When Running tokenise_bio.py on GRCh38.p13 FASTA, Successful on Smaller FASTA File

schultzjack commented 1 year ago

Issue Description: While attempting to reproduce the findings in your preprint, I'm encountering an error running the tokenise_bio.py script on large (3GB or larger) FASTA files, including the GCF_000001405.39_GRCh38.p13. It fails with a panic in the Rust code. The code runs successfully on (much) smaller FASTA files (200KB). I have tried using the original .fna file extension as well as .fasta but have not had success in getting past this particular error.

The error log indicates a panic due to a Rust-related error, specifically in the tokenizers-lib library. The panic occurred at the line tokenizers-lib/src/models/unigram rainer.rs:212:53, with the message "called Result::unwrap() on an Err value: Internal." The error thread panicked at called Result::unwrap() on an Err value: Internal, tokenizers-lib/src/models/unigram/trainer.rs:212:53 is related to the Hugging Face Tokenizers library.

To Reproduce OS: EC2 Instance Conda venv with python 3.9

Command Used for Error:

RUST_BACKTRACE=full tokenise_bio -i /data/ncbi_dataset/GCF_000001405.39_GRCh38.p13_genomic.fasta -t '/data/generated/ncbi_tokenisers/tokeniser_39_GRCh38.json

Output & error message: https://gist.github.com/stepwise-ai-dev/f23a79faaedd006bf51d486259440dd5

Steps to Reproduce: Run the tokenise_bio.py script with GCF_000001405.39_GRCh38.p13_genomic.fna as input

Expected output I expected similar output to previous successfully execution of tokenise_bio.py on a smaller (200KB) FASTA file. Successful output :https://gist.github.com/stepwise-ai-dev/bbd782f3d09afaca6219b2c3a176bb42

Questions:

Is this a known issue with the code or a known issue specific to certain system specifications?
Are there any additional specific requirements or configurations needed for processing larger genome file sizes?
Are there any additional data pre-processing steps required between downloading the NCBI data and using it as input for tokenise_bio.py?

Any guidance on resolving this issue would be greatly appreciated.

Thank you so much!

tyronechen commented 1 year ago

Hmm, I confirm that I replicated your issue

Sat Aug 12 21:33:26 AEST 2023
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Internal', tokenizers-lib/src/models/unigram/trainer.rs:212:53
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
COMMAND LINE ARGUMENTS FOR REPRODUCIBILITY:

     /██████/mambaforge-pypy3/envs/foo/bin/tokenise_bio -i GCF_000001405.39_GRCh38.p13_genomic.fna -t foo.json

Traceback (most recent call last):
  File "/██████/mambaforge-pypy3/envs/foo/bin/tokenise_bio", line 11, in <module>
    sys.exit(main())
  File "/██████/mambaforge-pypy3/envs/foo/lib/python3.9/site-packages/genomenlp/tokenise_bio.py", line 68, in main
    tokeniser.train_from_iterator(
  File "/██████/mambaforge-pypy3/envs/foo/lib/python3.9/site-packages/tokenizers/implementations/sentencepiece_unigram.py", line 142, in train_from_iterator
    self._tokenizer.train_from_iterator(
pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: Internal
Sat Aug 12 21:36:13 AEST 2023

To test if this is an issue with the unigram_trainer, I attempted to replicate this with the original rust implementation of the underlying library, and the error message implies that there is a sentence length limit (where a sentence corresponds to a single fasta read).

conda create -n something sentencepiece
conda activate something

# where the fasta file is (note that this generates files in its dir, so probably make a new dir with this inside only)
awk '/^>/ {OUT=substr($0,2) ".fa"}; OUT {print >OUT}' GCF_000001405.39_GRCh38.p13_genomic.fna
for f in *\ *; do mv "$f" "${f// /_}"; done
for f in *fa; do mv "$f" "${f//,/}"; done
for i in *fa; do sed -i -e 's|>.*||' -e '/^$/d' $i; done
for i in *fa; do cat $i | tr -d '\n' | tr [:lower:] [:upper:] > ${i/fa/fna}; done
spm_train --input=$(echo *.fna | sed 's| |,|g') --model_prefix=m --vocab_size=32000
...
trainer_interface.cc(408) LOG(INFO) Loaded all 75 sentences
trainer_interface.cc(415) LOG(INFO) Skipped 564 too long sentences.
...
# where each sentence is one fasta read

To test if this is due to sentence length, im segmenting chromosome 1 by itself with 128GB memory. Unknown if this will error, has not completed successfully yet. I'll continue to investigate this sentence/read length limitation further. It looks like others ran into similar memory issues with total corpus size, but considering that ours isn't that big in comparison, I'm not sure if this is also the case.

infile_path="NC_000001.11_HOMO_SAPIENS_CHROMOSOME_1_GRCH38.P13_PRIMARY_ASSEMBLY.fna"
tokenise_bio -i ${infile_path} -t foo.json

The only quick workaround I can think of for now is dividing your reads into blocks of eg 2048bp length, which other users also suggested. But I'm not sure how much impact this will have on preserving long-range contextual information. In any case pushed this as v2.7.1 available on conda now, or you can pull the latest tokenise_bio.py and inspect the new -b and -c options.

schultzjack commented 1 year ago

Thank you so much for linking to that helpful information and for uploading a workaround so promptly.

I just tried running v2.7.1 and it gave me a new error in the _gzip_iterator function. That function seems to have an issue In version 2.7.1 where it attempts to manipulate the read object itself, leading to an error. I've tested an approach that extracts the sequence from the read object and then applies the desired transformations (such as case changes or breaking into chunks) directly to the sequence string. This might align better with standard practices for handling strings in Python and leverage screed's native support, resolving the error and ensuring the function's intended behavior.

def _gzip_iterator(infile_paths, break_size: int=None, case: str=None):
    for path in infile_paths:
        with screed.open(path) as infile:
            for read in infile:
                sequence = read.sequence
                if case == "upper":
                    sequence = sequence.upper()
                elif case == "lower":
                    sequence = sequence.lower() # test 

                if break_size is None:
                    yield sequence
                else:
                    for i in range(0, len(sequence), break_size):
                        yield sequence[i:i+break_size]

If this looks like a promising fix for the read issue in the _gzip_iterator function, I'd be happy to create a pull request!

I've tested running tokenise_bio.py using this fix and it runs without the read errors, but does still give me the same RUST-related error as before regardless of the given break size (2048, 1024, 512, 256) or reduced vocab size (32k, 16k, 8k, 4k, 2k, 1k). It will still run on much smaller FASTAs (200kb) without producing the RUST-related error.

And one last thing to note is that using "lower" seems to lead to a unique error that does not appear in the other two cases (upper or none) for the case changing transformations, though I haven't yet figured out why.

Many thanks!

tyronechen commented 1 year ago

I just tried running v2.7.1 and it gave me a new error in the _gzip_iterator function. That function seems to have an issue In version 2.7.1 where it attempts to manipulate the read object itself, leading to an error. I've tested an approach that extracts the sequence from the read object and then applies the desired transformations (such as case changes or breaking into chunks) directly to the sequence string. This might align better with standard practices for handling strings in Python and leverage screed's native support, resolving the error and ensuring the function's intended behavior.

Ah yes, I thought i put the correct functions for the read object and str but it looks like I didn't test it thoroughly enough.

If this looks like a promising fix for the read issue in the _gzip_iterator function, I'd be happy to create a pull request!

It does look promising, yes please feel free to create a pull request (i dont have a dev branch so you can create directly on master, then ill retest it again and merge it in).

I've tested running tokenise_bio.py using this fix and it runs without the read errors, but does still give me the same RUST-related error as before regardless of the given break size (2048, 1024, 512, 256) or reduced vocab size (32k, 16k, 8k, 4k, 2k, 1k). It will still run on much smaller FASTAs (200kb) without producing the RUST-related error.

Hmm yes, the root cause of the error is probably the data size (from what we both found so far, and from what others have implied). The short-term workaround would be subdividing the data, but theres probably a better way around it which ill need to brainstorm.

And one last thing to note is that using "lower" seems to lead to a unique error that does not appear in the other two cases (upper or none) for the case changing transformations, though I haven't yet figured out why.

In that case for the short term, we'll make a note that lower case doesnt work until we find and retroactively fix this later. (I can add this in to doc and args helptext after i merge in your pull request).

From previous issue: To test if this is due to sentence length, im segmenting chromosome 1 by itself with 128GB memory.

Btw, this exceeded 128GB memory on chr1 alone and crashed unfortunately. I'll need to systematically test out a range of sequence lengths to get a better feel for the algorithm's limitations, since interestingly they dont mention that anywhere in their paper or github (or if they did I may have missed it).

tyronechen commented 1 year ago

_gzip_iterator issue pull request #7 tested with all combinations and OK (or errors correctly) (thanks!).

Continuing investigation into solutions for the long read issue. (Another piece of information supports the "cannot handle long reads").

As the next step, I may truncate the seqs into multiple chunks and pool the token weights at the end and see if that gives a realistic result. Might be an effective workaround.

tyronechen commented 1 year ago

seq length

Now testing a pooling operation where i break long seqs into shorter ones and compare token intersection + weights variance across the different blocks

seq quantity

Then will repeat for a case where i split multiple seqs into subsets

tyronechen commented 1 year ago

Added compare_empirical_tokens.py to compare multiple tokenised dataset json files to v2.8.0.

You can use this to examine your own datasets if you are interested - eg compare resulting token set and weights on varying slice lengths or sequence counts.

Results for a sweep on segmented sequence lengths from 2**9..13 compared to full chromosome will be out shortly on Ecoli K12.

tyronechen commented 1 year ago

Hi @schultzjack

Tested the read splitting in Haemophilus influenzae. I split the 1.8 Mbp whole genome into contigs ranging from 2**9..20 and ran empirical tokenisation.

unsurprisingly, the variance from ground truth (whole genome) was higher the smaller the contig length
however, the variance is actually not that big (for the most highly weighted tokens at least)
similarly, empirical token identity is more similar to ground truth the longer the contigs are (90-95% set identity)
in the process, discovered that tokenisation can handle seqs up to 2Mbp (4Mbp segfaults), so we can take a sentence size of 2Mbp as an upper limit. For normal use I suggest 1Mbp length splits, since the variance from ground truth was already minimal by this length (for this dataset anyway).

Compiling detailed information on the above now and will add a section to documentation under tokenisation when complete.

tyronechen commented 1 year ago

Documentation now demonstrates a case of handling long sequences

tyronechen / genomenlp

Panic in Rust Code When Running tokenise_bio.py on GRCh38.p13 FASTA, Successful on Smaller FASTA File #5