raw-lab / MetaCerberus

Python code for versatile Functional Ontology Assignments for Metagenomes searching via Hidden Markov Model (HMM) with environmental focus of shotgun metaomics data
BSD 3-Clause "New" or "Revised" License
46 stars 7 forks source link

Suggestions #2

Closed iferres closed 7 months ago

iferres commented 1 year ago

Hi, thank you for this awesome tool!

I have a couple of suggestions which may improve the (UX of the) software.

1) DB installation path: The --setup command doesn't allow to set a custom path to store the DB. Of course, in that case one should also be able to indicate metacerberus where to find it (new CLI option). It is sometime useful to dockerize the applications, and by having the database outside the container result in a way smaller image size. Multiple instances of the same images could use a single DB installed in a shared disk (think in a HPC environment).

2) Documentation: It's not well documented that users can opt to search more than one DB at once by passing them as comma-separated arguments (i.e. metacerberus.py ... --hmm VOG,PHROG), I had to guess it. Also it would be nice to have documentation about the output files.

3) A --list_db command to check available DBs.

4) Prioritize DB (more difficult to implement, consider it just a comment): I have noticed that the best hit column may not select the best annotated DB (which is a subjective characteristic). For instance, using the above case, I would prioritize PHROG over VOG, which annotation is better curated. But in some cases where both DBs got hits, the best is kept:

locus_tag       FOAM    KEGG    COG     CAZy    PHROG   VOG     Best hit        length_bp       e-value score   EC_number       product
FP13_CDS_0270                                   phrog_18957     VOG04864        VOG:VOG04864    211     1.8e-74 244.4           REFSEQ hypothetical protein

In the above case (I'm looking at step_10-visualizeData/Protein_phanotate/annotation_summary.tsv file), I would prefer PHROG annotation over VOG's since I'm sure I would get more information than hypothetical protein. It would be nice to have the possibility to set a DB priority ranking.

Sorry if sounds like a pedant review, I just think metacerberus has great potential and I would like it to be more user friendly :)

Regards

raw937 commented 1 year ago

Thank you for using MetaCerberus and your suggestions.

For your suggestions which we find very helpful. Can I asked some small clarifications.

  1. You would like an option to have the databases in a different location then it's current default?
  2. I think we fixed this with quick start examples for database selection in the readme. Please let us know if this is helpful? Are you asking for a map/readme of the outputs? That we can do.
  3. This command checks to see if the databases are present?
  4. By default the name it lists in the 'product' is from the best HMM hit. From your example, VOG was the best hit so we used it's name as the product. We do store all the hits from each database in output files. We will look into the map file from VOG to see if we can get more informative product names. But, do you think the PHROG product names should be default?

No apologies needed. We really appreciate your suggestions and thoughts. We also think it has great potential and user friendly is what we really strive for. Lets us know your thoughts? We welcome suggestions and ways to make our tools better.

many thanks, RAW

iferres commented 1 year ago
  1. Some research infrastructures don't recommend to install big databases in the same partition as the software, and have a dedicated partition for them. I think to have the ability to download them in other directory than default would be nice to comply with infra requirements. Also, if I want to dockerize (docker, docker-likes, or singularity/apptainer) the application, having the databases installed in the software directories makes the container too heavy.
  2. Yes, I think with a quick note on the README and in the --help would help. Regarding the output, a small description of the output subdirectories and about you consider are the most important files.
  3. I.. am actually not sure now on how to implement it given the first point (ability to download DBs to a non default directory) 😅 Maybe (just an idea) to check in the provided directory metacerberus.py ... --db_dir /path/to/dbs --list_dbs, and check in the default location if user don't provide the --db_dir argument. What do you think? (Please, name the arguments as you wish, I just invent some to illustrate the point)
  4. I think your default approach is great. You can't possibly know in advance the priorities of each user. I was thinking in a way to let each user decide which DB would prioritize instead of the default criteria (best hit). For instance if I pass metacerberus.py ... --hmm PHROG,VOG,KOFam_prokaryote --prioritize_hmm_order, then the tool will return PHROG annotation for those which have a hit for this DB, and then VOG for those which don't have a PHORG hit but a VOG, and so on.. My use-case scenario is that I'm interested in phages, and for my very particular case PHROG is the best DB.

Thanks for considering my suggestions and for the tool. You can close this issue, don't need to report back :) Bests!

raw-lab commented 7 months ago

We are working a new version 1.2 based on your suggestions.

Thank you for using MetaCerberus! RAW lab