thegenemyers / DAZZ_DB

The Dazzler Data Base
Other
35 stars 33 forks source link

`fasta2DB -i` does something strange #21

Closed pb-cdunn closed 8 years ago

pb-cdunn commented 8 years ago

I run this:

+ gunzip -c chunk_000.fasta.gz | fasta2DB -v orig.db -ichunk_000
Adding 'chunk_000.fasta' ...

+ gunzip -c chunk_001.fasta.gz | fasta2DB -v orig.db -ichunk_001
Adding 'chunk_001.fasta' ...

...

I find that orig.db keeps growing dramatically, with many lines for each fasta like this:

files =   3342474
          1 chunk_000 m151129_221014_sherri_c100947162550000001823216206251657_s1_p0
          2 chunk_000 m151124_152824_42161_c100947332550000001823216206251626_s1_p0
          3 chunk_000 m151129_221014_sherri_c100947162550000001823216206251657_s1_p0
          4 chunk_000 m151124_152824_42161_c100947332550000001823216206251626_s1_p0
...
    4306674 chunk_003 m151121_214615_42161_c100947322550000001823216206251633_s1_p0
    4324991 chunk_003 m151128_033329_42142_c100947132550000001823216206251683_s1_p0
...

Is this expected?

pb-cdunn commented 8 years ago

A file may contain the data from multiple SMRT cells provided the reads for each SMRT cell are consecutive in the file.

Ugh! Is that the problem? I was happyto learn that fasta2DB would finally handle a mixture of movies within a fasta input, so that I could drop our pre-processing. But it doesn't actually solve the problem. We still need to pre-process.

What if dexta re-ordered the reads for us, so that SMRT cells are consecutive?

thegenemyers commented 8 years ago

Yes, that is the problem. I don't understand how in the world you got your headers are all mixed up. Presumably you initially extracted information from bax.hd5 files and headers are at that time consecutive. I can see someone then concatenating several bax extracted fastas together into a single file which is what I was anticipating with the new version. But I did not expect someone to effectively shuffle them and don't see why or how that is necessary or useful. Please explain to me why your reads are a complete scramble from many different SMRT cells.

If you insist on having them totally scrambled then use fasta2DAM. The only thing you will loose is the knowledge of whether or not two reads come from the same well -- but given that your reads are a complete scramble I presume you don't care about that anyway ;-)

-- Gene

On 6/11/16, 8:00 PM, Christopher Dunn wrote:

Ugh! Is that the problem? I was happyto learn that fasta2DB would finally handle a mixture of movies within a fasta input, so that I could drop our pre-processing. But it doesn't actually solve the problem. We still need to pre-process.

What if dexta re-ordered the reads for us, so that SMRT cells are consecutive?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/thegenemyers/DAZZ_DB/issues/21#issuecomment-225380682, or mute the thread https://github.com/notifications/unsubscribe/AGkkNiL2D0TiCy3Nmym9QNAkuUgQuL9nks5qKvfHgaJpZM4Izmga.

pb-cdunn commented 8 years ago

LOL. Now that you mention it, that makes sense to me. I'll talk to the folks who gave me these data. Closing this Issue.