mdshw5 / pyfaidx

Efficient pythonic random access to fasta subsequences
https://pypi.python.org/pypi/pyfaidx
Other
459 stars 75 forks source link

False duplicate key error #150

Closed Benjamin-Lee closed 5 years ago

Benjamin-Lee commented 5 years ago

I was using some relatively simple code I wrote to generate FASTA files containing random sequences:

import random
import sys

number = int(sys.argv[1])
length = int(sys.argv[2])

for i in range(number):
    seq = ""
    for j in range(length):
        seq += random.choice("ATGC")
    with open(f"{number}-{length}bp-random-seqs.fa", "a+") as f:
        print(f"> seq {i}", file=f)
        print(seq, file=f)

The file ends up looking like this:

> seq 0
...
.
.
.
> seq n
...

However, I am getting the following error:

  File "/Users/BenjaminLee/.virtualenvs/squiggle/lib/python3.6/site-packages/pyfaidx/__init__.py", line 481, in read_fai
    raise ValueError('Duplicate key "%s"' % key)
ValueError: Duplicate key "seq"

It seems that it's only picking up on the seq in the description line, not on the identifier.

mdshw5 commented 5 years ago

This is expected behavior, which was chosen to mimic the samtools faidx behavior of splitting deflines on whitespace. You can specify that the .fai index should not be used as a key name by passing the read_long_names argument:

>>> genes = Fasta('10-1000bp-random-seqs.fa', read_long_names=True)
>>> genes
Fasta("10-1000bp-random-seqs.fa")
>>> genes.keys()
odict_keys([' seq 0', ' seq 1', ' seq 2', ' seq 3', ' seq 4', ' seq 5', ' seq 6', ' seq 7', ' seq 8', ' seq 9'])

Since duplicate keys are detected during index reading in pyfaidx, the .fai will appear to contain duplicate sequences (which may be a problem for other tools) but will not contain whitespace, and so I think is "more correct" with respect to samtools behavior. See #111 for more information about how I arrived at this behavior.

The documentation should definitely be clearer about this, as well as other Fasta and Faidx arguments, so feel free to submit a PR if you want to add something to the README.