merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
423 stars 144 forks source link

[BUG] Ecophylo workflow: constraints on hmm_list.txt (hmm_source) #2050

Open hdore opened 1 year ago

hdore commented 1 year ago

Hi Anvi'o developers!

Short description of the problem

I've been trying to use the Ecophylo workflow (actually for now just to save the workflow graph) and I am getting errors concerning the hmm_source file even though I am using only "Internal" hmm profiles.

anvi'o version

Anvi'o .......................................: hope (v7.1-dev)

Profile database .............................: 38
Contigs database .............................: 20
Pan database .................................: 16
Genome data storage ..........................: 7
Auxiliary data storage .......................: 2
Structure database ...........................: 2
Metabolic modules database ...................: 4
tRNA-seq database ............................: 2

System info

Using an HPC under Linux.

Detailed description of the issue

I got 2 separate errors. At first, I had included all of Bacteria_71 AND Archaea_76 hmm profiles, and I got complaints about the 'index' not being unique. I assume that this is because some of the hmm names are the same for Bacteria_71 and Archaea_76, so the names were not unique. Now I can understand that we might want to make two separate analyses for Bacteria and Archaea.

Then I made 2 different files, and tried launching the workflow with only Bacteria_71 hmm profiles. I got the following error:

Traceback (most recent call last):
  File "/home/datawork-lmee-intranet-nos/conda-env/anvio-dev/github/anvio/bin/anvi-run-workflow", line 77, in <module>
    main(args)
  File "/home/datawork-lmee-intranet-nos/conda-env/anvio-dev/github/anvio/bin/anvi-run-workflow", line 50, in main
    M.init()
  File "/home/datawork-lmee-intranet-nos/conda-env/anvio-dev/github/anvio/anvio/workflows/ecophylo/__init__.py", line 170, in init
    self.init_hmm_list_txt()
  File "/home/datawork-lmee-intranet-nos/conda-env/anvio-dev/github/anvio/anvio/workflows/ecophylo/__init__.py", line 430, in init_hmm_list_txt
    raise ConfigError(f"Please do not use "-" in your external hmm names in: "
TypeError: unsupported operand type(s) for -: 'str' and 'str'

This is more annoying: I was not using any external hmm, only internal ones. But some of them do have a - in their name (e.g. ATP-synt or tRNA-synt_1d). Since these are internal hmm profiles, I don't think I can change their names. Does that mean that we cannot use all of the internal hmm profiles from Bacteria_71?

I will move on by removing the few profiles that have a - in their name for now, but thought I should report this.

Thank you for all your hard work,

hdore

ivagljiva commented 11 months ago

Hi @hdore , I am sorry that this bug report has gone unanswered for so long. I know that Matt Schechter has been working hard with many updates and changes to the EcoPhylo workflow over the past year, so this may have already been solved in the meantime.

Are you still experiencing these errors? If yes, please report back (and it not, please close this issue). I will ping @mschecht here so that he gets notified of any updates to this issue thread :)

Thank you!

hdore commented 11 months ago

Hi @ivagljiva, Thanks for going through old issues! I got useful insights from @mschecht on Discord in the meantime.

I haven't tried to use EcoPhylo for a while, but since we are here, let me be a bit more precise. At the time I wrote this issue, I had not realized that EcoPhylo worked only with a single gene (at least when I was using it), so that explains the first error (though it might be useful to specify in the documentation that it works only one gene at a time, if it's still the case -- I know that @mschecht mentioned once that one day it would be possible to run it on multiple genes).

For the second issue, it was more annoying, since it means that some internal hmm profiles cannot be used. I guess one could extract the profiles from Anvi'o and use them as external with a different name, but I did not got that far.

That being said, there is already a number of internal hmm profiles that can be used in EcoPhylo if the goal is to compare results between genes. And this workflow is very powerful!

I don't have plans to use Ecophylo in the coming weeks, so feel free to close the issue if the Anvi'o team considers that these are not useful improvements.

Best,

hdore

ivagljiva commented 11 months ago

This is really good to know! Thanks for the context @hdore :)

I am not sure of all the recent improvements to EcoPhylo, so I will leave this issue open for now, and @mschecht can decide where to go from here.

mschecht commented 11 months ago

Thanks, @ivagljiva for bringing this issue to my attention, and apologies again @hdore for not seeing this earlier.

I'll watch issues more carefully for EcoPhylo-related topics in the future, but please tag me just in case :)

At the time I wrote this issue, I had not realized that EcoPhylo worked only with a single gene (at least when I was using it), so that explains the first error (though it might be useful to specify in the documentation that it works only one gene at a time, if it's still the case -- I know that @mschecht mentioned once that one day it would be possible to run it on multiple genes).

Thanks for pointing this out @hdore. If you have a chance, can you check the updated EcoPhylo documentation to see if that's clearly stated now? If you have suggestions I'd greatly appreciate it.

For the second issue, it was more annoying, since it means that some internal hmm profiles cannot be used. I guess one could extract the profiles from Anvi'o and use them as external with a different name, but I did not got that far.

That being said, there is already a number of internal hmm profiles that can be used in EcoPhylo if the goal is to compare results between genes. And this workflow is very powerful!

The only internal HMMs from Bacteria_71 and Archaea_76 that break the workflow are ones that do not seamlessly work with anvi-estimate-scg-taxonomy (which is only a couple and is on my TODO list to fix). You should be able to run any of these internal HMMs in those collections, separately one at a time just fine.

As @ivagljiva said, there have been many changes since this issue was posted so I think it's worth re-running your data. If you are still having issues please reach out and I'd be happy to help.