thegenemyers / DAZZ_DB

The Dazzler Data Base
Other
35 stars 33 forks source link

fasta2DB: Skip subreads with ranges greater than 2^16 #4

Closed pbjd closed 10 years ago

pbjd commented 10 years ago

Ran into this while integrating with HGAP. When casting int's to ushorts with subreads ranges larger than 2^16 corrupts beg/end values in the record buffer. A side effect of this is that the pbid in the DB becomes corrupted, creating a book-keeping issue in HGAP. Here are some examples found in my dataset:

m131021_044647_42175_c100583802550000001823087704281485_s1_p0/95966/47433_65693 m131023_015910_42175_c100583712550000001823087704281400_s1_p0/76055/59571_70914 m131023_015910_42175_c100583712550000001823087704281400_s1_p0/117753/66441_69466

I put in a simple fix to just skip subreads that fall in this category and emit a warning.

thegenemyers commented 10 years ago

The dazzler DB currently does not support cycle numbers greater than 2^16 as you have observed. I am surprised that such cycle numbers occur in your .fasta file as even with a 3 hour run I don't think the machine captures that many cycles. Can I ask where the .fasta's came from? How was the machine run?

A better fix would be to simply change the cycle numbers but preserve the length of the interval in fasta2DB. That way the data is not thrown away, just the actual cycle numbers are lost when one tries to recreate the input .fasta's with DB2fasta.

Ultimately I expect this machine to start producing reads of length over 2^16 at which point a number of shorts will have to become ints (and a few things will run a bit slower as a result).

thegenemyers commented 10 years ago

Mods addressed in my branch, thanks!