seqan / sharg-parser

The modern argument parser for c++ tools
https://docs.seqan.de/sharg
Other
10 stars 7 forks source link

Sequence file input validator does not accept gzipped files #77

Open sarahet opened 4 years ago

sarahet commented 4 years ago

When defining an input file with a validator for sequence input files the following way:

parser.add_option(options.queryFile, 'q', "query", "Query sequences.",
seqan3::option_spec::REQUIRED,
seqan3::input_file_validator<seqan3::sequence_file_input<>>{});

it does not accept .gz files but only [embl,fasta,fa,fna,ffn,faa,frn,fastq,fq,genbank,gb,gbk,sam]

Shouldn't this be possible as most sequence input files are actually compressed?

rrahn commented 4 years ago

Hey @sarahet thanks for the report.

This is currently indeed not supported and would be a feature. It does need some thoughts though to do it correctly. We will discuss this and see how we can put this into out road map.

sarahet commented 4 years ago

Great, thanks. A related issue would be that input validators generally do not have the possibility to allow for file endings hat include multiple dots, like .fastq.gz. Could that be included as well or is that rather not planned? Related https://github.com/seqan/lambda/issues/161

rrahn commented 4 years ago

yes, I guess that is somewhat related. is the multiple dot version restricted to 2 dots assuming one valid extension and one compression extension? Or are more compex cases possible?

sarahet commented 4 years ago

So far I can only think of one extension and a compression extension, but of course there could be other use cases, that I currently don't have in mind ..

smehringer commented 4 years ago

Is it a use-case to compress you file multiple times, e.g. fasta.gz.bz2? (I don't even know if this makes sense, just wondering)

rrahn commented 4 years ago

I think .fasta.tar.bz etc. could be. Not that we would support that at the moment but maybe interesting for the future

joergi-w commented 3 years ago

Would it be a solution to have a constant list of "compression file extensions" similar to the sequence file extensions? For a given filename, the validator algorithm would be as follows:

  1. search for a compression extension
  2. if found: possibly flag the file as compressed or assign the compression stream already, then remove the extension
  3. repeat 1 and 2 until "not found" (alternatively allow e.g. .tar.bz being one element and list all common ones)
  4. check the validity of the remaining filename

I suggest not to add the compression extensions to the help page, as the product of [file ext.] x [compression ext.] is too large and not helpful to repeat for each input file parameter. We can document somewhere that files with certain extensions get implicitly extracted.

smehringer commented 2 years ago

Core Meeting 2022-03-22

We will defer this feature until the seqan3 I/O design is fixed.

In the meantime, @eseiler will post a workaround here:

eseiler commented 2 years ago
My workaround ```cpp #include #include class my_validator : public seqan3::input_file_validator // No template param in sharg { public: my_validator() : my_validator{combined_extensions} {} my_validator(my_validator const &) = default; my_validator & operator=(my_validator const &) = default; my_validator(my_validator &&) = default; my_validator & operator=(my_validator &&) = default; ~my_validator() = default; explicit my_validator(std::vector const & extensions) { // my_validator::extensions_str = sharg::detail::to_string(extensions); // Sharg only my_validator::extensions = std::move(extensions); } // Optional for readable help page: std::string get_help_page_message() const { return seqan3::detail::to_string("The input file must exist and read permissions must be granted. Valid file extensions are: ", sequence_extensions, #if defined(SEQAN3_HAS_BZIP2) || defined(SEQAN3_HAS_ZLIB) ", possibly followed by ", compression_extensions, #endif '.'); } private: std::vector sequence_extensions{seqan3::detail::valid_file_extensions::valid_formats>()}; std::vector compression_extensions{[&] () { std::vector result; #ifdef SEQAN3_HAS_BZIP2 result.push_back("bz2"); #endif #ifdef SEQAN3_HAS_ZLIB result.push_back("gz"); result.push_back("bgzf"); #endif return result; }()}; std::vector combined_extensions{[&] () { if (compression_extensions.empty()) return sequence_extensions; std::vector result; for (auto && sequence_extension : sequence_extensions) { result.push_back(sequence_extension); for (auto && compression_extension : compression_extensions) result.push_back(sequence_extension + std::string{'.'} + compression_extension); } return result; }()}; }; int main() { std::string some_path{}; const char * argv[] = {"./test", "-h"}; seqan3::argument_parser parser{"test_parser", 2, argv, seqan3::update_notifications::off}; parser.add_option(some_path, 'i', "input", "Fancy descprition,", seqan3::option_spec::required, my_validator{}); parser.parse(); } ```
Possible output ``` test_parser =========== OPTIONS Basic options: -h, --help Prints the help page. -hh, --advanced-help Prints the help page including advanced options. --version Prints the version information. --copyright Prints the copyright/license information. --export-help (std::string) Export the help page information. Value must be one of [html, man]. -i, --input (std::string) Fancy descprition, The input file must exist and read permissions must be granted. Valid file extensions are: [embl,fasta,fa,fna,ffn,faa,frn,fas,fastq,fq,genbank,gb,gbk,sam], possibly followed by [bz2,gz,bgzf]. VERSION Last update: test_parser version: SeqAn version: 3.2.0-rc.1 ```

Works for both sharg and seqan3, I added two comments where the code for both differ. ALso, you would need to use the SHARG macros instead of SEQAN3.

The output for a failed validation is quite noisy for seqan3 (it will print all combinations of extensions X compression_extensions). With sharg, you could set the extensions_str to the same as what is returned in the help page message and get a nicer error message.