nf-core / modules

Repository to host tool-specific module files for the Nextflow DSL2 community!
https://nf-co.re/modules
MIT License
276 stars 687 forks source link

[FEATURE] A standing list of file extensions/file types that should be compressed #671

Open rpetit3 opened 3 years ago

rpetit3 commented 3 years ago

Is your feature request related to a problem? Please describe

Somewhat related to a problem, but also at the same time not really. I totally agree with compressed files should be used.

Where applicable, the usage and generation of compressed files SHOULD be enforced as input and output, respectively:

*.fastq.gz and NOT *.fastq
*.bam and NOT *.sam

I'm just not sure where to draw the line. The big ones (FASTQ and BAM) are straight forward, but I think it becomes more difficult for things like BLAST results, genome annotations, log files, etc...

I started a conversation in Slack, but think putting it here might be better.

Describe the solution you'd like

I'm wondering if there could be a standing list of extensions that should be compressed. Here's an example using Prokka outputs:

.err   failed annotations
.faa   proteins fasta
.ffn   genes fasta
.fna   contigs
.fsa   contigs
.gbk   genbank file
.gff   gff3 annotations
.log   prokka outputs
.sqn   sequin file
.tbl   tbl2asn file
.tsv   tsv of annotations
.txt   annotation stats 

I think most of these should be compressed, especially the FASTA ones. This is where I think a standing list of extensions, or maybe file types (e.g. sequences in FASTA format), would be useful as a guide and make the choice easier on submitters.

Here's a working list (big overlap with Prokka):

.aln   alignment
.fa    fasta
.faa   proteins fasta
.fasta fasta
.fastq fastq
.fq    fastq
.ffn   genes fasta
.fna   contigs
.fsa   contigs
.gbk   genbank file
.gfa   assembly graph
.gff   gff3 annotations
.sqn   sequin file
.tbl   tbl2asn file
.vcf   variants
mahesh-panchal commented 2 years ago

And how they should be compressed. Some files apparently benefit from being compressed with bgzip (from htslib toolkit) but that requires including it into the containers which is extra effort to maintain (see https://github.com/nf-core/modules/pull/1360#discussion_r816570773).

grst commented 2 years ago

If we have such a list, we could add a function to pytest-workflow which warns/fails if such a file was generated and not compressed.

jasmezz commented 1 year ago

Hi there!

We’ve noticed there hasn’t been much activity here. Are you still planning on working on this? If not, you can ignore this message and we’ll close your issue in about 2 weeks. If you think this is still relevant, you can also add it to the hackathon2023 project board.

Cheers the nf-core maintainers

lukbut commented 1 year ago

@rpetit3 Hello! Where did you envisage this list being placed? On the website or just in this ticket for now, until it's made part of pytest-workflow?