lmrodriguezr / nonpareil

Estimate metagenomic coverage and sequence diversity
http://enve-omics.ce.gatech.edu/nonpareil/
Other
42 stars 11 forks source link

Can't process gzipped fastq #35

Closed ohthetrees closed 3 weeks ago

ohthetrees commented 6 years ago

Hi, I'm just getting started with Nonpareil, thanks for your work.

I'm unable to process my gzipped fastq. If I first uncompress the file, it processes as expected. The error:

$ nonpareil -s ETNP_120m_R2.name.fastq.gz -t 4 -T kmer -f fastq -b ETNP_120m_R2.nonpareil.k
Nonpareil v3.301
Fatal error:
The file provided does not have the proper fastq format
 [      0.0] Fatal error: The file provided does not have the proper fastq format
lmrodriguezr commented 4 years ago

Sorry for the loooong delay, I'm back now at tending to the issues.

I believe this is an issue on the kmer kernel, that doesn't allow gzipped input due to the random access function it uses (@gunturus please comment if I'm wrong).

Unfortunately, I don't think this can be easily resolved. I'll leave this issue open until I add a corresponding comment to the documentation, but you'll have to unzip the fastq file prior to using nonpareil.

jfy133 commented 3 years ago

I'm starting to investigate nonpareil, and also had the same issue.

Having gzipped input support would be very useful to have, because I have >100 sequencing files all in >1GB file-size range, so having to decompress each time would be a bit nasty when trying to parallelise processing all the files at once.

So I would like to give support to this, if a solution is feasible (even if there is an internal temporary decompression)!

lmrodriguezr commented 3 years ago

@gunturus Do you have an update on this issue? I know you were looking into it. Thanks!

jfy133 commented 3 years ago

@gunturus do you have any more news? I'm interested in potentially adding nonpariel to the nf-core/eager pipeline, but the lack of gzip support is unfortunately a deal breaker...

gunturus commented 3 years ago

@jfy133 unfortunately gzip is not supported. @lmrodriguezr do you have any suggestions to provide gzip support? I have no idea.

jfy133 commented 3 years ago

Do you think this is in anyway on a roadmap @lmrodriguezr? Just to know if I should look for different solutions instead.

VGalata commented 2 years ago

I would also like to add that having support for compressed FASTQ files would be good.

lmrodriguezr commented 2 years ago

Hello. We're finally back at this issue, and it's top of the roadmap. An initial not-so-clean solution would be to unzip the files into a temporary directory, launch nonpareil, and then remove the directory. Would this work as a temporary solution? If yes, I can implement it into a bash wrapper so you could use it out of the box.

A more robust solution is to read directly from the zipped file, but this will take some heavy lifting because we will need to replace a random file access with another method. It's also doable, but I'll take us a bit longer, so hopefully the first option works in the meantime?

VGalata commented 2 years ago

Dear @lmrodriguezr,

Thank you very much for looking into this!

For our purpose, having the second option being implemented would be better. We use nonpareil in a snakemake workflow where we want to move away from using unzipped FASTQ files and we would like to avoid unnecessary unzipping if possible. And, as you are saying it yourself, that would be also a more robust solution and I think it would be worth waiting for it.

jfy133 commented 2 years ago

@lmrodriguezr we are in the same situtation as @VGalata as we would like to add it to a nextflow pipeline ;).

However, I think unzipping to a /tmp location & automatic cleanup after might be an OK temporary workaround, as at then at least we ourselves don't then have to deal with the unzipping itself. On the otherhand this depends on the implementatoin, and whether you rely on an internal unzipping library within the bash script, or rely on a tool already used on a users machine (which is much more flaky, unfortunately as it's this is often frustratingly not very portable).

But depending on the time it takes for the more robust solution, I guess I would prefer to wait a bit longer (thus time investment) goes into an 'inbuilt' solution.

jfy133 commented 2 years ago

@lmrodriguezr just another thought... would it be easier to refactor input to allow stdin?

then could simply to zcat <fastq>.gz | nonpareil <additional params?

Just sayin' as also would be fine with me in terms of accepting gzipped input in terms of useability.

davidecarlson commented 5 months ago

Just wanted to chime in with more support for enabling compressed fastq files!

jfy133 commented 3 weeks ago

🎉🎉🎉🎉🎉 woohoo! Great to see this @lmrodriguezr ! Thanks for getting to it!