kfuku52 / amalgkit

RNA-seq data amalgamation for a large-scale evolutionary transcriptomics
BSD 3-Clause "New" or "Revised" License
7 stars 1 forks source link

Error when a config file is missing #108

Closed kfuku52 closed 1 year ago

kfuku52 commented 1 year ago

Currently, amalgkit metadata requires a complete set of config files. However, some config files are rarely used, so amalgkit probably shouldn't raise an error but just print that it didn't detect a file that may be used as input. @Hego-CCTB Could you take care of it?

amalgkit metadata: start
Traceback (most recent call last):
  File "/opt/conda/envs/biotools/bin/amalgkit", line 378, in <module>
    args.handler(args)
  File "/opt/conda/envs/biotools/bin/amalgkit", line 14, in command_metadata
    metadata_main(args)
  File "/opt/conda/envs/biotools/lib/python3.9/site-packages/amalgkit/metadata.py", line 615, in metadata_main
    check_config_dir(args)
  File "/opt/conda/envs/biotools/lib/python3.9/site-packages/amalgkit/metadata.py", line 31, in check_config_dir
    assert (af in files), 'config file not found: '+af
AssertionError: config file not found: exclude_keyword.config
Hego-CCTB commented 1 year ago

So I did some testing and technically amalgkit metadata works without any config files at all. But I think search_term_species.config and search_term_keyword.config should be there, just so amalgkit metadata doesn't try to get the whole SRA. I'd raise warnings for all missing files (saying that some metadata functionalities may not work properly) and raise an actual error for missing search_term_species.config and search_term_keyword.config. What do you think? Is this too lenient?

kfuku52 commented 1 year ago

There will be the cases where the two files are not necessary. For example, if you specify BioProject, the species name would be completely redundant and not necessary in many cases to specify. Your concern doesn't seem to be about the specific config files but about too many SRA entries to process. It might make more sense to raise a warning for >10k hit entries, for example.

Hego-CCTB commented 1 year ago

On the other hand, sometimes I want large amounts of entries too. When I was looking for potential species to add to my analysis, I made a fairly open query to gather as much information as possible and ended up with 100k+ entries I could then narrow down manually/by parsing.

In that case we can leave to the user's digression and just raise the warnings? Would you prefer specific warnings for each of the files (i.e. what impact not having that file may have), or just have the warning refer to the wiki where we explain the config files?

kfuku52 commented 1 year ago

In that case we can leave to the user's digression and just raise the warnings?

yes, let's do so.

or just have the warning refer to the wiki where we explain the config files?

I like this idea!

Hego-CCTB commented 1 year ago

https://github.com/kfuku52/amalgkit/commit/c2b8bebc7c2b0ec539cbb481231b041255552387