Open rpetit3 opened 3 years ago
And how they should be compressed.
Some files apparently benefit from being compressed with bgzip
(from htslib toolkit) but that requires including it into the containers which is extra effort to maintain (see https://github.com/nf-core/modules/pull/1360#discussion_r816570773).
If we have such a list, we could add a function to pytest-workflow
which warns/fails if such a file was generated and not compressed.
Hi there!
We’ve noticed there hasn’t been much activity here. Are you still planning on working on this? If not, you can ignore this message and we’ll close your issue in about 2 weeks. If you think this is still relevant, you can also add it to the hackathon2023 project board.
Cheers the nf-core maintainers
@rpetit3 Hello! Where did you envisage this list being placed? On the website or just in this ticket for now, until it's made part of pytest-workflow
?
Is your feature request related to a problem? Please describe
Somewhat related to a problem, but also at the same time not really. I totally agree with compressed files should be used.
I'm just not sure where to draw the line. The big ones (FASTQ and BAM) are straight forward, but I think it becomes more difficult for things like BLAST results, genome annotations, log files, etc...
I started a conversation in Slack, but think putting it here might be better.
Describe the solution you'd like
I'm wondering if there could be a standing list of extensions that should be compressed. Here's an example using Prokka outputs:
I think most of these should be compressed, especially the FASTA ones. This is where I think a standing list of extensions, or maybe file types (e.g. sequences in FASTA format), would be useful as a guide and make the choice easier on submitters.
Here's a working list (big overlap with Prokka):