Open sarahet opened 4 years ago
Hey @sarahet thanks for the report.
This is currently indeed not supported and would be a feature. It does need some thoughts though to do it correctly. We will discuss this and see how we can put this into out road map.
Great, thanks. A related issue would be that input validators generally do not have the possibility to allow for file endings hat include multiple dots, like .fastq.gz
. Could that be included as well or is that rather not planned? Related https://github.com/seqan/lambda/issues/161
yes, I guess that is somewhat related. is the multiple dot version restricted to 2 dots assuming one valid extension and one compression extension? Or are more compex cases possible?
So far I can only think of one extension and a compression extension, but of course there could be other use cases, that I currently don't have in mind ..
Is it a use-case to compress you file multiple times, e.g. fasta.gz.bz2
? (I don't even know if this makes sense, just wondering)
I think .fasta.tar.bz
etc. could be. Not that we would support that at the moment but maybe interesting for the future
Would it be a solution to have a constant list of "compression file extensions" similar to the sequence file extensions? For a given filename, the validator algorithm would be as follows:
.tar.bz
being one element and list all common ones)I suggest not to add the compression extensions to the help page, as the product of [file ext.] x [compression ext.] is too large and not helpful to repeat for each input file parameter. We can document somewhere that files with certain extensions get implicitly extracted.
We will defer this feature until the seqan3 I/O design is fixed.
In the meantime, @eseiler will post a workaround here:
Works for both sharg and seqan3, I added two comments where the code for both differ. ALso, you would need to use the SHARG macros instead of SEQAN3.
The output for a failed validation is quite noisy for seqan3 (it will print all combinations of extensions X compression_extensions
). With sharg, you could set the extensions_str
to the same as what is returned in the help page message and get a nicer error message.
When defining an input file with a validator for sequence input files the following way:
it does not accept
.gz
files but only[embl,fasta,fa,fna,ffn,faa,frn,fastq,fq,genbank,gb,gbk,sam]
Shouldn't this be possible as most sequence input files are actually compressed?