shahab-sarmashghi / Skmer

Fast and accurate tool for estimating genomic distances between genome-skims
https://shahab-sarmashghi.github.io/Skmer/
Other
39 stars 8 forks source link

Allow .gz files as input #2

Open shahab-sarmashghi opened 6 years ago

BenKuhnhaeuser commented 2 years ago

Hi Shahab, I wonder whether this has yet been implemented? I'm working on a huge dataset and not having to unzip all fastq files would be super useful. Many thanks, Ben

shahab-sarmashghi commented 2 years ago

Hi Ben, ultimately inputs need to be decompressed since skmer runs jellyfish internally, and jellyfish doesn't support compressed inputs. I can add .gz input support to skmer, but be ware that effectively it would decompress the input, write it to a temp disk space, and then remove it after the processing, something that can be done using a wrapper (ex. bash) script too. I'll try to implement and test this when I find some time to work on it. I'll post here once done.

BenKuhnhaeuser commented 2 years ago

Hi Shahab, Thank you for providing these insights. It might be quite a computational burden if decompressing (and after running Skmer re-compressing) needs to be done in a single job, and maybe your suggestion of a wrapper script makes more sense. I've done that for now on my on task. But if you can find a straightforward way of dealing with compressed files using parallelisation (e.g. using pigz) then it might still be worthwhile as it would make Skmer a bit more user-friendly.