some sequences are missing in pyfastx.Fasta object

lmdu / pyfastx

a python package for fast random access to sequences from plain and gzipped FASTA/Q files

https://pyfastx.readthedocs.io

MIT License

268 stars 23 forks source link

some sequences are missing in pyfastx.Fasta object #41

Open dawnmy opened 2 years ago

dawnmy commented 2 years ago

I loaded a fasta file containing 4542 sequences with average length of 2.5kb, however only 4539 sequences were in the pyfastx.Fasta object.

fa = pyfastx.Fasta('assembly.fasta')
fa['contig_4540'] # keyError

Besides, I could access a sequence e.g. fa['contig_999'] for the first time. But when I try to access it again I got keyError.

The version of pyfastx I used is 0.8.4, Python version 3.7

lmdu commented 2 years ago

Thank you for reporting this issue. I will check that. A new version will be released soon.

floccinauc commented 1 year ago

Any updates on this? I'm getting the same error: I'm loading a large fasta file (~59M entries), and for some of the indices (when accessing by string key and by integer index), I'm getting a key does not exist error. Reloading the file solves the problem for given keys, but shifts it to others. I'm using pyfastx 1.1.0

lmdu commented 1 year ago

Thanks. Could you provide me your code and data https links.

floccinauc commented 1 year ago

I'm using the unzipped version of this file https://stringdb-downloads.org/download/protein.sequences.v12.0.fa.gz. As for my code, the simple snippet below does not seem to reproduce this error:

import pyfastx from tqdm import tqdm FILEPATH="/dccstor/bmfmbio/datasets/STRING/all/protein.sequences.v12.0.fa" loaded_fasta = pyfastx.Fasta(FILEPATH) for idx in tqdm(range(int(5e7))): a = loaded_fasta[idx]

Maybe it has to do with multiple workers accessing the same fasta file? I'm afraid I cannot post the actual code I'm using at this point.