False duplicate key error

mdshw5 / pyfaidx

Efficient pythonic random access to fasta subsequences

Other

459 stars 75 forks source link

import random import sys number = int(sys.argv[1]) length = int(sys.argv[2]) for i in range(number): seq = "" for j in range(length): seq += random.choice("ATGC") with open(f"{number}-{length}bp-random-seqs.fa", "a+") as f: print(f"> seq {i}", file=f) print(seq, file=f)

File "/Users/BenjaminLee/.virtualenvs/squiggle/lib/python3.6/site-packages/pyfaidx/__init__.py", line 481, in read_fai raise ValueError('Duplicate key "%s"' % key) ValueError: Duplicate key "seq"

This is expected behavior, which was chosen to mimic the samtools faidx behavior of splitting deflines on whitespace. You can specify that the .fai index should not be used as a key name by passing the read_long_names argument:

>>> genes = Fasta('10-1000bp-random-seqs.fa', read_long_names=True)
>>> genes
Fasta("10-1000bp-random-seqs.fa")
>>> genes.keys()
odict_keys([' seq 0', ' seq 1', ' seq 2', ' seq 3', ' seq 4', ' seq 5', ' seq 6', ' seq 7', ' seq 8', ' seq 9'])

Since duplicate keys are detected during index reading in pyfaidx, the .fai will appear to contain duplicate sequences (which may be a problem for other tools) but will not contain whitespace, and so I think is "more correct" with respect to samtools behavior. See #111 for more information about how I arrived at this behavior.

The documentation should definitely be clearer about this, as well as other Fasta and Faidx arguments, so feel free to submit a PR if you want to add something to the README.

mdshw5 / pyfaidx

False duplicate key error #150