Closed Benjamin-Lee closed 6 years ago
Looks like something I should be able to support! It's not such an edge case for people that might edit a FASTA file using Windows...
Based on the file you provided, the issue is that I didn't account for python's universal newline support (https://www.python.org/dev/peps/pep-0278/). I'll see what I can do about working around that. Thanks for opening this issue!
Looks like I have full support for carriage returns now. Thanks again for reporting this!
Hi @mdshw5, I ran into what I think is a related issue, but more subtle than the above example.
If I enter this sequence (similar sequences have same behavior) to a file with windows newlines:
>test
TACACACGAATAAAAGATAACAAAGATGAGTAAAGGAGAAGAACTTTTCACTGGAGTTGTCCCAATTCTTGTTGAATTAGATGGCGATGTTAATGGGCAAAAATTCTCTGTCAGTGGAGAGGGTGAAGGTGATGCAACATACGGAAAACTTACCCTTAAATTTATTTGCACTACTGGGAAGCTACCTGTTCCATGGCCAACACTTGTCACTACTTTCTCTTATGGTGTTCAATGCTTTTCAAGATACCCAGATCATATGAAACAGCATGACTTTTTCAAGAGTGCCATGCCCGAAGGTTATGTACAGGAAAGAACTATATTTTACAAAGATGACGGGAACTACAAGACACGTGCTGAAGTCAAGTTTGAAGGTGATACCCTTGTTAATAGAATCGAGTTAAAAGGTATTGATTTTAAAGAAGATGGAAACATTCTTGGACACAAAATGGAATACAACTATAACTCACATAATGTATACATCATGGCAGACAAACCAAAGAATGGAATCAAAGTTAACTTCAAAATTAGACACAACATTAAAGATGGAAGCGTTCAATTAGCAGACCATTATCAACAAAATACTCCAATTGGCGATGGCCCTGTCCTTTTACCAGACAACCATTACCTGTCCACACAATCTGCCCTTTCCAAAGATCCCAACGAAAAGAGAGATCACATGATCCTTCTTGAGTTTGTAACAGCTGCTGGGATTACACATGGCATGGATGAACTATACAAATAAATGTCCAGACTTCCAATTGACACTAAAGTGTCCGAACAATTACTAAATTCTCAGGGTTCCTGGTTAAATTCAGGCTGAGACTTTATTTATATATTTATAGATTCATTAAAATTTTATGAATAATTTATTGATGTTATTAATAGGGGCTATTTTCTTATTAAATAGGCTACTGGAGTG
for example:
dna = "TACACA[...]" # seq above
with open('test.fa', 'w') as output:
output.write('>test\r\n%s\r\n' % dna)
And then I extract a sequence:
import pyfaidx
input_fasta = pyfaidx.Fasta("test.fa")
print(len(str(input_fasta["test"][0:500])))
I get a sequence that is 1bp too long. This seems to be true for any endpoint (at least when starting at the beginning) that is past half the length of the sequence, and that behavior is consistent across diff sequence lengths from 1kb to 20kb that I tested (I didn't test any higher than 20kb). I am not sure how/if it depends on the start location.
Anyway, is obviously a minor/irrelevant behavior for most use cases, but it happened to be relevant in my case so thought would let you know. Assume must be related to index in some way. Thanks for developing pyfaidx, super useful!
@andrewhill157 This is a really interesting issue, and I really appreciate you reporting it. Thanks for the detailed description - I was able to reproduce the problem and will think about what might be causing it. Initially looking through the code I'm a bit stumped. However it did key me in to at least one place where I'm not respecting the original line endings (#170) though unfortunately not related to this issue.
It looks like I was screwing things up when calculating the offset for a sequence when there were Windows line endings.
I tracked down the offending calculation and it's obvious I shouldn't have been multiplying the total number of newlines by the line terminator length:
## WRONG
>>> 459 / 919 * 2
0.998911860718172
>>> int(459 / 919 * 2)
0
>>> int(460 / 919 * 2)
1
## CORRECT
>>> 459 / 919
0.499455930359086
>>> int(460 / 919 * 1)
0
>>> int(459 / 919 * 1)
0
I fixed this properly I think:
diff --git a/pyfaidx/__init__.py b/pyfaidx/__init__.py
index 2f30b30..a211098 100644
--- a/pyfaidx/__init__.py
+++ b/pyfaidx/__init__.py
@@ -659,11 +659,12 @@ def from_file(self, rname, start, end, internals=False):
# Calculate offset (https://github.com/samtools/htslib/blob/20238f354894775ed22156cdd077bc0d544fa933/faidx.c#L398)
newlines_before = int(
- (start0 - 1) / i.lenc * (i.lenb - i.lenc)) if start0 > 0 and i.lenc else 0
- newlines_to_end = int(end / i.lenc * (i.lenb - i.lenc)) if i.lenc else 0
+ (start0 - 1) / i.lenc) if start0 > 0 and i.lenc else 0
+ newlines_to_end = int(end / i.lenc) if i.lenc else 0
newlines_inside = newlines_to_end - newlines_before
- seq_blen = newlines_inside + seq_len
- bstart = i.offset + newlines_before + start0
+ newline_blen = i.lenb - i.lenc
+ seq_blen = newlines_inside * newline_blen + seq_len
+ bstart = i.offset + newlines_before * newline_blen + start0
if seq_blen < 0 and self.strict_bounds:
raise FetchError("Requested coordinates start={0:n} end={1:n} are "
"invalid.\n".format(start, end))
@@ -674,7 +675,7 @@ def from_file(self, rname, start, end, internals=False):
with self.lock:
if self._bgzf: # We can't add to virtual offsets, so we need to read from the beginning of the record and trim the beginning if needed
self.file.seek(i.offset)
- chunk = start0 + newlines_before + newlines_inside + seq_len
+ chunk = start0 + (newlines_before * newline_blen) + (newlines_inside * newline_blen) + seq_len
chunk_seq = self.file.read(chunk).decode()
seq = chunk_seq[start0 + newlines_before:]
else:
The tests haven't run yet, but I'll bet this fixes your issue (unless I've made a typo and broken everything :)).
Awesome, thanks very much for looking into and fixing so quickly (I assume it worked on your end?), really appreciate the help! Also thanks for the explanation and pointing to the changes, very helpful for my own understanding : )
Thanks for the help. I tagged a new release v0.5.9.2 and this contains the fix. Tests are passing but I should probably add your example as a test case too.
This is a weird edge case, but it appears that FASTA files with CRs break the parser.
The file that I'm trying to parse is here. It looks like it's only getting the last line: