mdshw5 / pyfaidx

Efficient pythonic random access to fasta subsequences
https://pypi.python.org/pypi/pyfaidx
Other
459 stars 75 forks source link

Carriage return breaks parsing #141

Closed Benjamin-Lee closed 6 years ago

Benjamin-Lee commented 6 years ago

This is a weird edge case, but it appears that FASTA files with CRs break the parser.

The file that I'm trying to parse is here. It looks like it's only getting the last line:

In [5]: from pyfaidx import Fasta

In [6]: for seq in Fasta("1000seq.fasta"):
   ...:     print(seq)
   ...:     break
   ...:
ATGGAGGCAGCCAGCCACGGTAGGAGTTGGGTCTACCCGAAGATGGTGCGCTAACCAGCA

In [7]: from Bio import SeqIO

In [8]: for seq in SeqIO.parse("1000seq.fasta", "fasta"):
   ...:     print(str(seq.seq))
   ...:     break
   ...:
CACGTCGGACGAGATCTTCGGATCTAGTGGCGGACGGGTGAGTAACGCGTGGGAACGTGCCCTTTGCTACGGAATAGTCCCGGGAAACTGGGTTTAATACCGTATACGCCCTTCGGGGGAAAGATTTATCGGCAAAGGATCGGCCCGCGTTGGATTAGGTAGTTGGTGGGGTAATGGCCTACCAAGCCGACGATCCATAGCTGGTTTGAGAGGATGATCAGCCACACTGGGACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTGGGGAATCTTAGACAATGGGGGAAACCCTGATCTAGCCATGCCGCGTGAGCGATGAAGGCCTTAGGGTTGTAAAGCTCTTTCGCTGGGGAAGATAATGACGGTACCCAGTAAAGAAGCCCCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGGGCTAGCGTTGTTCGGAATTACTGGGCGTATAGCGCGCGTTAGGCGGATTGGAAAGTTGGGGGTGAAATCCCGGGGCTCAACCCCGGAACTGCCTCCAAAACTATCAGTCTGGAGTTCGAGAGAGGTGAGTGGAATTCCGAGTGTAGAGGTGAAATTCGTAGATGTTCGGAGGAACACCAGTGGCGAAGGCGGCTCACTGGCTCGATACTGACGCTGAGGTGCGAAAGTGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACACCGTAAACGATGAATGCCAGACGTCGGGCAGTATACTGTTCGGTGTCACACCTAACGGATTAAGCATTCCGCCTGGGGAGTACGGTCGCAAGATTAAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGCAGAACCTTACCAACCCTTGACATGGGTATCGCGGGCCCAGAGATGGTCCTTTCAGTTCGGCTGGATACCACACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTCGGTTAAGTCCGGCAACGAGCGCAACCCACGTCCCTAGTTGCCAGCAGGTCAAGCTGGGCACTCTATGGAAACTGCCGATGATAAGTCGGAGGAAGGTGTGGATGACGTCAAGTCCTCATGGCCCTTACGGGTTGGGCTACACACGTGCTACAATGGTAGTGACAGTGGGATAATCCCAAAAAGCTATCTCAGTTCGGATTGGGGTCTGCAACTCGACCCCATGAAGTCGGAATCGCTAGTAATCGTGGAACAGCATGCCACGGTGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCACACCATGGGAGTTGGGTCTACCCGAAGATGGTGCGCTAACCAGCAATGGAGGCAGCCAGCCACGGTAGGCTCAGCGACTGGGGTGAAGTCG
mdshw5 commented 6 years ago

Looks like something I should be able to support! It's not such an edge case for people that might edit a FASTA file using Windows...

mdshw5 commented 6 years ago

Based on the file you provided, the issue is that I didn't account for python's universal newline support (https://www.python.org/dev/peps/pep-0278/). I'll see what I can do about working around that. Thanks for opening this issue!

mdshw5 commented 6 years ago

Looks like I have full support for carriage returns now. Thanks again for reporting this!

andrewhill157 commented 3 years ago

Hi @mdshw5, I ran into what I think is a related issue, but more subtle than the above example.

If I enter this sequence (similar sequences have same behavior) to a file with windows newlines:

>test
TACACACGAATAAAAGATAACAAAGATGAGTAAAGGAGAAGAACTTTTCACTGGAGTTGTCCCAATTCTTGTTGAATTAGATGGCGATGTTAATGGGCAAAAATTCTCTGTCAGTGGAGAGGGTGAAGGTGATGCAACATACGGAAAACTTACCCTTAAATTTATTTGCACTACTGGGAAGCTACCTGTTCCATGGCCAACACTTGTCACTACTTTCTCTTATGGTGTTCAATGCTTTTCAAGATACCCAGATCATATGAAACAGCATGACTTTTTCAAGAGTGCCATGCCCGAAGGTTATGTACAGGAAAGAACTATATTTTACAAAGATGACGGGAACTACAAGACACGTGCTGAAGTCAAGTTTGAAGGTGATACCCTTGTTAATAGAATCGAGTTAAAAGGTATTGATTTTAAAGAAGATGGAAACATTCTTGGACACAAAATGGAATACAACTATAACTCACATAATGTATACATCATGGCAGACAAACCAAAGAATGGAATCAAAGTTAACTTCAAAATTAGACACAACATTAAAGATGGAAGCGTTCAATTAGCAGACCATTATCAACAAAATACTCCAATTGGCGATGGCCCTGTCCTTTTACCAGACAACCATTACCTGTCCACACAATCTGCCCTTTCCAAAGATCCCAACGAAAAGAGAGATCACATGATCCTTCTTGAGTTTGTAACAGCTGCTGGGATTACACATGGCATGGATGAACTATACAAATAAATGTCCAGACTTCCAATTGACACTAAAGTGTCCGAACAATTACTAAATTCTCAGGGTTCCTGGTTAAATTCAGGCTGAGACTTTATTTATATATTTATAGATTCATTAAAATTTTATGAATAATTTATTGATGTTATTAATAGGGGCTATTTTCTTATTAAATAGGCTACTGGAGTG

for example:

dna = "TACACA[...]" # seq above

with open('test.fa', 'w') as output:
    output.write('>test\r\n%s\r\n' % dna)

And then I extract a sequence:

import pyfaidx

input_fasta = pyfaidx.Fasta("test.fa")
print(len(str(input_fasta["test"][0:500])))

I get a sequence that is 1bp too long. This seems to be true for any endpoint (at least when starting at the beginning) that is past half the length of the sequence, and that behavior is consistent across diff sequence lengths from 1kb to 20kb that I tested (I didn't test any higher than 20kb). I am not sure how/if it depends on the start location.

Anyway, is obviously a minor/irrelevant behavior for most use cases, but it happened to be relevant in my case so thought would let you know. Assume must be related to index in some way. Thanks for developing pyfaidx, super useful!

mdshw5 commented 3 years ago

@andrewhill157 This is a really interesting issue, and I really appreciate you reporting it. Thanks for the detailed description - I was able to reproduce the problem and will think about what might be causing it. Initially looking through the code I'm a bit stumped. However it did key me in to at least one place where I'm not respecting the original line endings (#170) though unfortunately not related to this issue.

mdshw5 commented 3 years ago

It looks like I was screwing things up when calculating the offset for a sequence when there were Windows line endings.

I tracked down the offending calculation and it's obvious I shouldn't have been multiplying the total number of newlines by the line terminator length:

## WRONG
>>> 459 / 919 * 2
0.998911860718172
>>> int(459 / 919 * 2)
0
>>> int(460 / 919 * 2)
1
## CORRECT
>>> 459 / 919
0.499455930359086
>>> int(460 / 919 * 1)
0
>>> int(459 / 919 * 1)
0

I fixed this properly I think:

diff --git a/pyfaidx/__init__.py b/pyfaidx/__init__.py
index 2f30b30..a211098 100644
--- a/pyfaidx/__init__.py
+++ b/pyfaidx/__init__.py
@@ -659,11 +659,12 @@ def from_file(self, rname, start, end, internals=False):

         # Calculate offset (https://github.com/samtools/htslib/blob/20238f354894775ed22156cdd077bc0d544fa933/faidx.c#L398)
         newlines_before = int(
-            (start0 - 1) / i.lenc * (i.lenb - i.lenc)) if start0 > 0 and i.lenc else 0
-        newlines_to_end = int(end / i.lenc * (i.lenb - i.lenc)) if i.lenc else 0
+            (start0 - 1) / i.lenc) if start0 > 0 and i.lenc else 0
+        newlines_to_end = int(end / i.lenc) if i.lenc else 0
         newlines_inside = newlines_to_end - newlines_before
-        seq_blen = newlines_inside + seq_len
-        bstart = i.offset + newlines_before + start0
+        newline_blen = i.lenb - i.lenc
+        seq_blen = newlines_inside * newline_blen + seq_len
+        bstart = i.offset + newlines_before * newline_blen + start0
         if seq_blen < 0 and self.strict_bounds:
             raise FetchError("Requested coordinates start={0:n} end={1:n} are "
                              "invalid.\n".format(start, end))
@@ -674,7 +675,7 @@ def from_file(self, rname, start, end, internals=False):
         with self.lock:
             if self._bgzf:  # We can't add to virtual offsets, so we need to read from the beginning of the record and trim the beginning if needed
                 self.file.seek(i.offset)
-                chunk = start0 + newlines_before + newlines_inside + seq_len
+                chunk = start0 + (newlines_before * newline_blen) + (newlines_inside * newline_blen) + seq_len
                 chunk_seq = self.file.read(chunk).decode()
                 seq = chunk_seq[start0 + newlines_before:]
             else:

The tests haven't run yet, but I'll bet this fixes your issue (unless I've made a typo and broken everything :)).

andrewhill157 commented 3 years ago

Awesome, thanks very much for looking into and fixing so quickly (I assume it worked on your end?), really appreciate the help! Also thanks for the explanation and pointing to the changes, very helpful for my own understanding : )

mdshw5 commented 3 years ago

Thanks for the help. I tagged a new release v0.5.9.2 and this contains the fix. Tests are passing but I should probably add your example as a test case too.