open2c / distiller-nf

A modular Hi-C mapping pipeline
MIT License
85 stars 24 forks source link

Option to use fastq-dump --split-files, instead of pyfilesplit #155

Closed mimakaev closed 3 years ago

mimakaev commented 3 years ago

I noticed that some of the .sra files are missing one side for a large fraction of the reads. Those are usually due to some problems with processing, but it may be nice to be able to map them as is. It would be hence nice to have an option to download/split files using fastq-dump --split-files.

https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR8453396 is an example of an SRR that is missing 50% of one of the two sides. pyfilesplit does not handle this perfectly.

golobor commented 3 years ago

i just ran into the same issue! Can you expand a little bit on this - why do you think that fastq-dump would handle this better?

golobor commented 3 years ago

ok, i'll implement an option to use fastq-dump --split-3 then?..

mimakaev commented 3 years ago

I think it should be "--split-files --gzip"

Historically we chose not to do this and wrote pyfilesplit for the following reason. Fastq-dump is fast intrinsically, but if you ask it to gzip the output, it does so using the same thread and one core, and ends up painfully slow. Also, there was no nice way to make fastq-dump write to two streams, so we wrote pyfilesplit to make splitting for us.

This all makes sense only if downstream processing (pairtools) is fine with the read being on one side, and not on the other side... @golobor Is this true?

golobor commented 3 years ago

agreed with both. Except, i suspect that --split-files would still keep unpaired reads in the two output files. --split-3 , on the other hand, creates the third file which contains unpaired reads. I'll try to post a commit today

golobor commented 3 years ago

solved by https://github.com/open2c/distiller-nf/commit/5573b98532c859bbda26525eb286de155b4deb23 Feel free to open if it doesn't work!