Closed Benjamin-Lee closed 5 years ago
This is expected behavior, which was chosen to mimic the samtools faidx
behavior of splitting deflines on whitespace. You can specify that the .fai
index should not be used as a key name by passing the read_long_names
argument:
>>> genes = Fasta('10-1000bp-random-seqs.fa', read_long_names=True)
>>> genes
Fasta("10-1000bp-random-seqs.fa")
>>> genes.keys()
odict_keys([' seq 0', ' seq 1', ' seq 2', ' seq 3', ' seq 4', ' seq 5', ' seq 6', ' seq 7', ' seq 8', ' seq 9'])
Since duplicate keys are detected during index reading in pyfaidx, the .fai
will appear to contain duplicate sequences (which may be a problem for other tools) but will not contain whitespace, and so I think is "more correct" with respect to samtools behavior. See #111 for more information about how I arrived at this behavior.
The documentation should definitely be clearer about this, as well as other Fasta
and Faidx
arguments, so feel free to submit a PR if you want to add something to the README.
I was using some relatively simple code I wrote to generate FASTA files containing random sequences:
The file ends up looking like this:
However, I am getting the following error:
It seems that it's only picking up on the
seq
in the description line, not on the identifier.