merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
413 stars 142 forks source link

[BUG] `anvi-get-sequences-for-hmm-hits` alignment silently fails #2200

Closed Ge0rges closed 5 months ago

Ge0rges commented 5 months ago

Short description of the problem

anvi-get-sequences-for-hmm-hits fails saying the expected output from muscle is missing when large genes are present.

anvi'o version

Anvi'o .......................................: marie (v8-dev)
Python .......................................: 3.10.12

Profile database .............................: 40
Contigs database .............................: 22
Pan database .................................: 17
Genome data storage ..........................: 7
Auxiliary data storage .......................: 2
Structure database ...........................: 2
Metabolic modules database ...................: 4
tRNA-seq database ............................: 2

System info

Installed according to provided instructions on Rocky Linux.

Detailed description of the issue

Running anvi-get-sequences-for-hmm-hits --external-genomes external-genomes.txt --hmm-source RM_DefenseFinder --return-best-hit --get-aa-sequences --concatenate -o concatenated-rm-proteins.fa on around 85 000 databases. The command fails saying the expected output from muscle is missing. When I look at the referenced log, it is devoid of errors and just lists the command that was run muscle -quiet.

This was originally posted on Discord, and moved here following a suggestion by meren.

@meren stated:

It happens when there are extremely long 'genes'. Like over 50000 nucleotides due to poor gene calls, which makes muscle fail.

A conversation then followed regarding potential solutions the most relevant (and recent) exchange is quoted below.

Meren:

We could add a flag to the program that says --ignore-gene-calls-longer-than XXXnts. But in that case you wouldn't want to get gene calls for a given model in your HMM collection from all genomes except one that has the long gene call for one genome. Since there is a difference between a gene that is missing from a genome, and one that is discarded due to a problem. I'm not sure what is the best way to handle this explicitly.

Me:

I think such a flag would be acceptable, especially if the length of a gene is not compatible with the HMM model being used. I.e. if a model is compatible with a gene up to X length, but we're ignoring genes greater than Y, and Y >> X, then this doesn't affect the results presumably? Then if X ~= Y, we could print a warning?

meren commented 5 months ago

Thank you very much for taking the time to report this, @Ge0rges. The changes are in the master repository now. It seemed to work well during my tests, and I hope it will serve you well, too.

Please keep us posted regarding your experience with this.

Ge0rges commented 5 months ago

Hi @meren, writing back with an update. I ran the command with --ignore-genes-longer-than 40000 appended and have run into the same issue. I don't think this is a bug in your fix however based on the anvio output. Perhaps just an issue with my data?

Here's some of the last output from my command along with the contents of the log (identical to before):

WARNING
===============================================
SequencesForHMMHits class here. The current database (at /localdata/researchdriv
e/gkanaan/methylation_phylogeny/anvio_dbs/GCF_943192935_1_genomic.db) contains 0
HMM hits, at least within the HMM sources or splits that were requested. It
might not be a problem for your case, but we just thought you should know, in
case it is. So there you have it.

Sources ......................................: RM_DefenseFinder                                                                                                                                                                                                                          
Hits .........................................: 2109054 HMM hits for 1 source(s)
Genes of interest ............................: None

A MESSAGE FROM YOUR GENE LENGTH FILTER 📏
===============================================
You asked anvi'o to remove genes that are longer than 40000 nts from your HMM
hits. But none of the gene calls were longer than that value, so you get to keep
everything.

WARNING
===============================================
You requested only the best hits to be reported, which means, if, say, there are
more than one RecA hits in a bin for a given HMM source, only the one with the
lowest e-value will be kept, and others will be removed from your final results.

WARNING
===============================================
You requested to get only the best hits, but you did not provide a profile
database. At this point anvi'o just hopes you know what you are doing. Since
this is like the zone of 'potentially a terrible idea but it may be quite
relevant when done right'.

Filtered hits ................................: 1331928 hits remain after removing weak hits for multiple genes

CITATION
===============================================
The workflow you are using will likely use 'muscle' by Edgar,
doi:10.1093/nar/gkh340 (http://www.drive5.com/muscle) to align your sequences.
If you publish your findings, please do not forget to properly credit this tool.

WARNING
===============================================
You did not define any gene names. Bold move. Now anvi'o will attempt to report
a file with all genes defined in your HMM source(s). This will likely be quite
ugly, so please brace yourself.

[18 Jan 24 20:53:26 Aligning homolog gene sequences pre-concatenation] working on Type_II_MTases_FAM_3 (1 of 114) ...                                                                                                                                                                     

Config Error: Drivers::Muscle: Something went wrong with this alignment that was working on
              39448 sequences :/ You can find the output in this log file:
              /tmp/tmpdv7r35yp/00_log.txt

(anvio-dev) [gkanaan@deming methylation_phylogeny]$  cat /tmp/tmpdv7r35yp/00_log.txt
# DATE: 18 Jan 24 20:53:26
# CMD LINE: muscle -quiet
# THIS IS THE OUTPUT YOU ARE LOOKING FOR:

Do you think this is still due to a gene length issue or should I look into other causes?

meren commented 5 months ago

Why did you start with 40K and not something like 5K or 10K? If you don't have a specific reason, please take a more length-conservative step first.

If that fails too, please send me the FASTA file that is failing privately so I can see if there is anything funny I can find :)

Ge0rges commented 5 months ago

Hi @meren I lowered it to 25000 which is which still excluded 0 sequences, I looked at the logs and it seems like there are few longest sequences at around 24K so that makes sense. Is that still beyond the expected length for functionality? I could rerun it with a significantly lower filter as you suggested but that's going to start excluding quite a few sequences.

By FASTA file do you mean the genome DB?

meren commented 5 months ago

At this stage it is not about getting the final results from your analysis but to figure out if the gene lengths are playing a role in muscle's demise. If you have time, please consider trying it with 10K so we have an answer here. Keep in mind that the average gene length for bacteria and archae is 1K.

By the FASTA I meant the gene sequences file muscle is working on before crashing. It should be shown in the logs. If not, I should implement it so you can fetch the problematic FASTA.

Ge0rges commented 5 months ago

Good idea @meren, just a heads up that I don't see a FASTA sequence in the log file anvio points to or in anvio's output. I'm trying with 10K now.

meren commented 5 months ago

Hey @Ge0rges, today I realized that the way we have implemented sequence alignment step for phylogenomics was not necessarily most reliable one when working with extremely large numbers of sequences. Perhaps this issue will be completely resolved by this change that I just merged into the master repository. Would you also consider running your analysis exactly the same way before after a git pull?

I also included support for --debug flag for the alignment step. If you use that flag, every single FASTA file used for the alignment will be stored in a temp directory (which will not be deleted at the end of the run) for debugging purposes (but you will have to manually remove that directory later). You will see these kinds of messages in your terminal if you use that flag after this update:

[DEBUG] `run_command` is running .............: muscle -in /var/folders/gw/5mdblzs94gsb1ss44llgl3_h0000gn/T/tmpeujk60a4/input.fa -out /var/folders/gw/5mdblzs94gsb1ss44llgl3_h0000gn/T/tmpeujk60a4/output.fa -quiet

[DEBUG] `run_command` is running .............: muscle -in /var/folders/gw/5mdblzs94gsb1ss44llgl3_h0000gn/T/tmpzsgrynh1/input.fa -out /var/folders/gw/5mdblzs94gsb1ss44llgl3_h0000gn/T/tmpzsgrynh1/output.fa -quiet

[DEBUG] `run_command` is running .............: muscle -in /var/folders/gw/5mdblzs94gsb1ss44llgl3_h0000gn/T/tmpw6pz5p2q/input.fa -out /var/folders/gw/5mdblzs94gsb1ss44llgl3_h0000gn/T/tmpw6pz5p2q/output.fa -quiet

[DEBUG] `run_command` is running .............: muscle -in /var/folders/gw/5mdblzs94gsb1ss44llgl3_h0000gn/T/tmplyiqlqm8/input.fa -out /var/folders/gw/5mdblzs94gsb1ss44llgl3_h0000gn/T/tmplyiqlqm8/output.fa -quiet

[DEBUG] `run_command` is running .............: muscle -in /var/folders/gw/5mdblzs94gsb1ss44llgl3_h0000gn/T/tmpghyej_65/input.fa -out /var/folders/gw/5mdblzs94gsb1ss44llgl3_h0000gn/T/tmpghyej_65/output.fa -quiet

[DEBUG] `run_command` is running .............: muscle -in /var/folders/gw/5mdblzs94gsb1ss44llgl3_h0000gn/T/tmpl78tx6z5/input.fa -out /var/folders/gw/5mdblzs94gsb1ss44llgl3_h0000gn/T/tmpl78tx6z5/output.fa -quiet

And the last one before crashing will likely be the one that we will need to examine more carefully :)

Apologies for the inconvenience.

Ge0rges commented 5 months ago

Hi @meren, FYI before your latest changes a filter of 10000 produced the same result. I am now retuning with your new changes and the --debug flag with 10000 length filter. Will report back! Thank you for your continued help.

Ge0rges commented 5 months ago

Hey @meren the attached file fails with a filter of 10000. Here's the log:

WARNING
===============================================
You did not define any gene names. Bold move. Now anvi'o will attempt to report
a file with all genes defined in your HMM source(s). This will likely be quite
ugly, so please brace yourself.

[DEBUG] `run_command` is running .............: muscle -in /tmp/tmp708ev8ce/input.fa -out /tmp/tmp708ev8ce/output.fa -quiet                                                                     

Traceback for debugging
================================================================================
  File "/Accounts/gkanaan/github/anvio/bin/anvi-get-sequences-for-hmm-hits", line 343, in <module>                                                                                              
    main(args)
  File "/Accounts/gkanaan/github/anvio/bin/anvi-get-sequences-for-hmm-hits", line 263, in main
    s.store_hmm_sequences_into_FASTA(hmm_sequences_dict,
  File "/Accounts/gkanaan/github/anvio/anvio/hmmops.py", line 922, in store_hmm_sequences_into_FASTA                                                                                            
    self.__store_concatenated_hmm_sequences_into_FASTA(hmm_sequences_dict_for_splits, output_file_path, partition_file_path, wrap, separator, genes_order, align_with, just_do_it)              
  File "/Accounts/gkanaan/github/anvio/anvio/hmmops.py", line 812, in __store_concatenated_hmm_sequences_into_FASTA                                                                             
    genes_in_bins_dict[gene_name] = aligner(run=terminal.Run(verbose=False)).run_default(genes_list, debug=anvio.DEBUG)                                                                         
  File "/Accounts/gkanaan/github/anvio/anvio/drivers/muscle.py", line 120, in run_default
    output = utils.run_command(cmd_line, log_file_path)
  File "/Accounts/gkanaan/github/anvio/anvio/utils.py", line 514, in run_command
    raise ConfigError("Command failed to run. What command, you say? This: '%s'" % ' '.join(cmdline))                                                                                           
================================================================================

Config Error: Command failed to run. What command, you say? This: 'muscle -in
              /tmp/tmp708ev8ce/input.fa -out /tmp/tmp708ev8ce/output.fa -quiet'

input.txt

meren commented 5 months ago

Hey @Ge0rges, thanks for this. I'm looking into it. In the meantime, you should find a log file under the directory /tmp/tmp708ev8ce/. Do you see anything interesting in there? Can you please share that one too?

Ge0rges commented 5 months ago

Hi @meren thanks for your time this again, that logo ice is empty other than # CMD LINE: muscle -in /tmp/tmp708ev8ce/input.fa -out /tmp/tmp708ev8ce/output.fa -quiet

meren commented 5 months ago

Hey @Ge0rges, I just realized that the log is empty due to the -quiet flag anvi'o is passing to muscle :/ Here is a test and how it looks like on my end with that flag hardcoded in anvio/drivers/muscle.py:

cat /var/folders/0w/z3rt2nfs3d7dxf6jkl8s75v00000gn/T/tmp3hgfccxv/00_log.txt
# DATE: 31 Jan 24 19:38:13
# CMD LINE: muscle -in /var/folders/0w/z3rt2nfs3d7dxf6jkl8s75v00000gn/T/tmp3hgfccxv/input.fa -out /var/folders/0w/z3rt2nfs3d7dxf6jkl8s75v00000gn/T/tmp3hgfccxv/output.fa -quiet

And here is the log file for the same command after removing it:

cat /var/folders/0w/z3rt2nfs3d7dxf6jkl8s75v00000gn/T/tmp7p4qe4px/00_log.txt
# DATE: 31 Jan 24 19:37:44
# CMD LINE: muscle -in /var/folders/0w/z3rt2nfs3d7dxf6jkl8s75v00000gn/T/tmp7p4qe4px/input.fa -out /var/folders/0w/z3rt2nfs3d7dxf6jkl8s75v00000gn/T/tmp7p4qe4px/output.fa

MUSCLE v3.8.1551 by Robert C. Edgar

http://www.drive5.com/muscle
This software is donated to the public domain.
Please cite: Edgar, R.C. Nucleic Acids Res 32(5), 1792-97.

input 8 seqs, lengths min 114, max 135, avg 121
00:00:00      1 MB(0%)  Iter   1  100.00%  K-mer dist pass 1
00:00:00      1 MB(0%)  Iter   1  100.00%  K-mer dist pass 2
00:00:00      2 MB(0%)  Iter   1  100.00%  Align node
00:00:00      2 MB(0%)  Iter   1  100.00%  Root alignment
00:00:00      2 MB(0%)  Iter   2  100.00%  Refine tree
00:00:00      2 MB(0%)  Iter   2  100.00%  Root alignment
00:00:00      2 MB(0%)  Iter   2  100.00%  Root alignment
00:00:00      2 MB(0%)  Iter   3  100.00%  Refine biparts
00:00:00      2 MB(0%)  Iter   4  100.00%  Refine biparts
00:00:00      2 MB(0%)  Iter   5  100.00%  Refine biparts
00:00:00      2 MB(0%)  Iter   6  100.00%  Refine biparts

I just committed this change, so when you run your command again after a git pull you certainly will see something from muscle for failed alignments. Apologies for all the back and forth here and thank you for your patience!

Ge0rges commented 5 months ago

Here's anvio's output:

[DEBUG] `run_command` is running .............: muscle -in /tmp/tmprlsurk6m/input.fa -out /tmp/tmprlsurk6m/output.fa                                         

Traceback for debugging
================================================================================                                                                             
  File "/Accounts/gkanaan/github/anvio/bin/anvi-get-sequences-for-hmm-hits", line 343, in <module>                                                           
    main(args)
  File "/Accounts/gkanaan/github/anvio/bin/anvi-get-sequences-for-hmm-hits", line 263, in main                                                               
    s.store_hmm_sequences_into_FASTA(hmm_sequences_dict,
  File "/Accounts/gkanaan/github/anvio/anvio/hmmops.py", line 922, in store_hmm_sequences_into_FASTA                                                         
    self.__store_concatenated_hmm_sequences_into_FASTA(hmm_sequences_dict_for_splits, output_file_path, partition_file_path, wrap, separator, genes_order, align_with, just_do_it)
  File "/Accounts/gkanaan/github/anvio/anvio/hmmops.py", line 812, in __store_concatenated_hmm_sequences_into_FASTA                                          
    genes_in_bins_dict[gene_name] = aligner(run=terminal.Run(verbose=False)).run_default(genes_list, debug=anvio.DEBUG)                                      
  File "/Accounts/gkanaan/github/anvio/anvio/drivers/muscle.py", line 120, in run_default                                                                    
    output = utils.run_command(cmd_line, log_file_path)
  File "/Accounts/gkanaan/github/anvio/anvio/utils.py", line 514, in run_command                                                                             
    raise ConfigError("Command failed to run. What command, you say? This: '%s'" % ' '.join(cmdline))                                                        
================================================================================                                                                             

Config Error: Command failed to run. What command, you say? This: 'muscle -in
              /tmp/tmprlsurk6m//input.fa -out /tmp/tmprlsurk6m/output.fa

Attached is the input.fa (renamed .txt for GitHub upload) located in /tmp/tmprlsurk6m/. input.txt

The log file is huge because many lines are just copy of themselves with the timestamp updated. So here's the output instead, of cat 00_log.txt in that same directory. I manually confirmed the last line of the file resembled what cat printed in case it omitted errors for some reason.

# DATE: 31 Jan 24 18:50:12
# CMD LINE: muscle -in /tmp/tmprlsurk6m/input.fa -out /tmp/tmprlsurk6m/output.fa

MUSCLE v3.8.1551 by Robert C. Edgar

http://www.drive5.com/muscle
This software is donated to the public domain.
Please cite: Edgar, R.C. Nucleic Acids Res 32(5), 1792-97.

input 39448 seqs, lengths min 61, max 2119, avg 352
01:32:55  12555 MB(-786%)  Iter   1  100.00%  K-mer dist pass 1
01:34:45  12555 MB(-786%)  Iter   1  100.00%  K-mer dist pass 2
01:41:04  12555 MB(-786%)  Iter   1  100.00%  Align node       
01:41:20  12555 MB(-786%)  Iter   1  100.00%  Root alignment
28:46:02  12555 MB(-786%)  Iter   2  100.00%  Refine tree   
28:46:14  12555 MB(-786%)  Iter   2  100.00%  Root alignment
28:46:21  12555 MB(-786%)  Iter   2  100.00%  Root alignment

Seems to me like it's running fine, which is making me wonder if perhaps this isn't something ridiculous on my end like running out of memory (although the server has over 1TB of RAM...).

meren commented 5 months ago

I downloaded your input.txt and did the following to see what the output looks like:

$ mv input.txt input.orig
$ head -n 100 input.orig > input.txt
$ muscle -in input.txt -out input.aligned

The output for this looked like this:

MUSCLE v3.8.1551 by Robert C. Edgar

http://www.drive5.com/muscle
This software is donated to the public domain.
Please cite: Edgar, R.C. Nucleic Acids Res 32(5), 1792-97.

input 500 seqs, lengths min 124, max 695, avg 348
00:00:00      4 MB(0%)  Iter   1  100.00%  K-mer dist pass 1
00:00:00      4 MB(0%)  Iter   1  100.00%  K-mer dist pass 2
00:00:02    152 MB(0%)  Iter   1  100.00%  Align node
00:00:02    153 MB(0%)  Iter   1  100.00%  Root alignment
00:00:06    160 MB(0%)  Iter   2  100.00%  Refine tree
00:00:06    162 MB(0%)  Iter   2  100.00%  Root alignment
00:00:06    162 MB(0%)  Iter   2  100.00%  Root alignment
00:01:39    203 MB(1%)  Iter   3  100.00%  Refine biparts
00:03:01    207 MB(1%)  Iter   4  100.00%  Refine biparts
00:04:24    212 MB(1%)  Iter   5  100.00%  Refine biparts
00:05:45    217 MB(1%)  Iter   6  100.00%  Refine biparts
00:05:50    217 MB(1%)  Iter   7  100.00%  Refine biparts
00:05:50    217 MB(1%)  Iter   7  100.00%  Refine biparts

Which is different than yours. I'm wondering about a few things:

image
export MUSCLE_PARAMS="-maxhours 6 -maxiters 2"

In addition to the last idea, can you imagine a way to improve things on anvi'o side?

meren commented 5 months ago

I now added this last idea to master if you like to try with -maxiters 2. It seems to be working as expected on my test case:

$ export MUSCLE_PARAMS="-maxiters 2"
$ anvi-get-sequences-for-hmm-hits -c CONTIGS.db --hmm-source Bacteria_71 --gene-names "Ribosomal_S13,RNA_pol_L" --concat --return-best-hit --get-aa-sequences --no-wrap --debug -p PROFILE.db -C default

(...)
[DEBUG] `run_command` is running .............: muscle -in /var/folders/0w/z3rt2nfs3d7dxf6jkl8s75v00000gn/T/tmpov2aji31/input.fa -out /var/folders/0w/z3rt2nfs3d7dxf6jkl8s75v00000gn/T/tmpov2aji31/output.fa -maxiters 2
(...)
$ cat /var/folders/0w/z3rt2nfs3d7dxf6jkl8s75v00000gn/T/tmpqs05cuat/00_log.txt
# DATE: 03 Feb 24 12:48:28
# CMD LINE: muscle -in /var/folders/0w/z3rt2nfs3d7dxf6jkl8s75v00000gn/T/tmpqs05cuat/input.fa -out /var/folders/0w/z3rt2nfs3d7dxf6jkl8s75v00000gn/T/tmpqs05cuat/output.fa -maxiters 2

MUSCLE v3.8.1551 by Robert C. Edgar

http://www.drive5.com/muscle
This software is donated to the public domain.
Please cite: Edgar, R.C. Nucleic Acids Res 32(5), 1792-97.

input 8 seqs, lengths min 114, max 135, avg 121
00:00:00      1 MB(0%)  Iter   1  100.00%  K-mer dist pass 1
00:00:00      1 MB(0%)  Iter   1  100.00%  K-mer dist pass 2
00:00:00      2 MB(0%)  Iter   1  100.00%  Align node
00:00:00      2 MB(0%)  Iter   1  100.00%  Root alignment
00:00:00      2 MB(0%)  Iter   2  100.00%  Refine tree
00:00:00      2 MB(0%)  Iter   2  100.00%  Root alignment
00:00:00      2 MB(0%)  Iter   2  100.00%  Root alignment
Ge0rges commented 5 months ago

Hi @meren seems like this is not an issue with Anvi'o. I tried running the same command as you (I weirdly do get negative usage numbers) but eventually run into a segfault in muscle. Likely due to the size and type of the data. Guess that's what I get for trying to align all the viral defense genes in all the GTDB genomes! Sorry for wasting your time on this, I will continue looking for an alternative on my end.

meren commented 5 months ago

Thanks for looking into this further, @Ge0rges. I wonder if it is worthwhile trying it with famsa. If it works, anvi'o has a driver for it as well and you can change the aligner through a command-line option for a workflow.

Otherwise, it may be possible to use something like MMSeqs2 to cluster the sequences first at a level of 80% similarity or something, and then align individually the top 5-6 clusters which would likely make up over 90% of sequences.

As the number of genomes increase, we will continue running into issues like this so this was very helpful to prepare for those times and I am sorry anvi'o couldn't be more helpful to you :) But if you would like to investigate alternative approaches together and bounce ideas with someone, I'm here for that.

Best wishes,

Ge0rges commented 5 months ago

Hi @meren, I'm definitely willing to be your guinea pig here if that's more widely useful. I will say that with maxiters 2, I was able to get my task to complete. I'm unsure however how that has affected the alignment quality (since the better one doesn't complete and I do not have a ground truth). I will try famsa and report back.

Ge0rges commented 4 months ago

Hi @meren, sorry for the delay on this, but I just got to trying famsa and got this error:

Traceback for debugging
================================================================================                                                                             
  File "/Accounts/gkanaan/github/anvio/bin/anvi-get-sequences-for-hmm-hits", line 343, in <module>                                                           
    main(args)
  File "/Accounts/gkanaan/github/anvio/bin/anvi-get-sequences-for-hmm-hits", line 263, in main                                                               
    s.store_hmm_sequences_into_FASTA(hmm_sequences_dict,
  File "/Accounts/gkanaan/github/anvio/anvio/hmmops.py", line 922, in store_hmm_sequences_into_FASTA                                                         
    self.__store_concatenated_hmm_sequences_into_FASTA(hmm_sequences_dict_for_splits, output_file_path, partition_file_path, wrap, separator, genes_order, align_with, just_do_it)
  File "/Accounts/gkanaan/github/anvio/anvio/hmmops.py", line 812, in __store_concatenated_hmm_sequences_into_FASTA                                          
    genes_in_bins_dict[gene_name] = aligner(run=terminal.Run(verbose=False)).run_default(genes_list, debug=anvio.DEBUG)                                      
  File "/Accounts/gkanaan/github/anvio/anvio/drivers/famsa.py", line 107, in run_default                                                                     
    raise ConfigError("Default alignment option for Famsa is not implemented :/")                                                                            
================================================================================                                                                             

Config Error: Default alignment option for Famsa is not implemented :/
meren commented 4 months ago

image

I will try to implement this ASAP. It'll probably take a few days :(