Closed Ge0rges closed 5 months ago
Thank you very much for taking the time to report this, @Ge0rges. The changes are in the master repository now. It seemed to work well during my tests, and I hope it will serve you well, too.
Please keep us posted regarding your experience with this.
Hi @meren, writing back with an update. I ran the command with --ignore-genes-longer-than 40000
appended and have run into the same issue. I don't think this is a bug in your fix however based on the anvio output. Perhaps just an issue with my data?
Here's some of the last output from my command along with the contents of the log (identical to before):
WARNING
===============================================
SequencesForHMMHits class here. The current database (at /localdata/researchdriv
e/gkanaan/methylation_phylogeny/anvio_dbs/GCF_943192935_1_genomic.db) contains 0
HMM hits, at least within the HMM sources or splits that were requested. It
might not be a problem for your case, but we just thought you should know, in
case it is. So there you have it.
Sources ......................................: RM_DefenseFinder
Hits .........................................: 2109054 HMM hits for 1 source(s)
Genes of interest ............................: None
A MESSAGE FROM YOUR GENE LENGTH FILTER 📏
===============================================
You asked anvi'o to remove genes that are longer than 40000 nts from your HMM
hits. But none of the gene calls were longer than that value, so you get to keep
everything.
WARNING
===============================================
You requested only the best hits to be reported, which means, if, say, there are
more than one RecA hits in a bin for a given HMM source, only the one with the
lowest e-value will be kept, and others will be removed from your final results.
WARNING
===============================================
You requested to get only the best hits, but you did not provide a profile
database. At this point anvi'o just hopes you know what you are doing. Since
this is like the zone of 'potentially a terrible idea but it may be quite
relevant when done right'.
Filtered hits ................................: 1331928 hits remain after removing weak hits for multiple genes
CITATION
===============================================
The workflow you are using will likely use 'muscle' by Edgar,
doi:10.1093/nar/gkh340 (http://www.drive5.com/muscle) to align your sequences.
If you publish your findings, please do not forget to properly credit this tool.
WARNING
===============================================
You did not define any gene names. Bold move. Now anvi'o will attempt to report
a file with all genes defined in your HMM source(s). This will likely be quite
ugly, so please brace yourself.
[18 Jan 24 20:53:26 Aligning homolog gene sequences pre-concatenation] working on Type_II_MTases_FAM_3 (1 of 114) ...
Config Error: Drivers::Muscle: Something went wrong with this alignment that was working on
39448 sequences :/ You can find the output in this log file:
/tmp/tmpdv7r35yp/00_log.txt
(anvio-dev) [gkanaan@deming methylation_phylogeny]$ cat /tmp/tmpdv7r35yp/00_log.txt
# DATE: 18 Jan 24 20:53:26
# CMD LINE: muscle -quiet
# THIS IS THE OUTPUT YOU ARE LOOKING FOR:
Do you think this is still due to a gene length issue or should I look into other causes?
Why did you start with 40K and not something like 5K or 10K? If you don't have a specific reason, please take a more length-conservative step first.
If that fails too, please send me the FASTA file that is failing privately so I can see if there is anything funny I can find :)
Hi @meren I lowered it to 25000 which is which still excluded 0 sequences, I looked at the logs and it seems like there are few longest sequences at around 24K so that makes sense. Is that still beyond the expected length for functionality? I could rerun it with a significantly lower filter as you suggested but that's going to start excluding quite a few sequences.
By FASTA file do you mean the genome DB?
At this stage it is not about getting the final results from your analysis but to figure out if the gene lengths are playing a role in muscle's demise. If you have time, please consider trying it with 10K so we have an answer here. Keep in mind that the average gene length for bacteria and archae is 1K.
By the FASTA I meant the gene sequences file muscle is working on before crashing. It should be shown in the logs. If not, I should implement it so you can fetch the problematic FASTA.
Good idea @meren, just a heads up that I don't see a FASTA sequence in the log file anvio points to or in anvio's output. I'm trying with 10K now.
Hey @Ge0rges, today I realized that the way we have implemented sequence alignment step for phylogenomics was not necessarily most reliable one when working with extremely large numbers of sequences. Perhaps this issue will be completely resolved by this change that I just merged into the master
repository. Would you also consider running your analysis exactly the same way before after a git pull
?
I also included support for --debug
flag for the alignment step. If you use that flag, every single FASTA file used for the alignment will be stored in a temp directory (which will not be deleted at the end of the run) for debugging purposes (but you will have to manually remove that directory later). You will see these kinds of messages in your terminal if you use that flag after this update:
[DEBUG] `run_command` is running .............: muscle -in /var/folders/gw/5mdblzs94gsb1ss44llgl3_h0000gn/T/tmpeujk60a4/input.fa -out /var/folders/gw/5mdblzs94gsb1ss44llgl3_h0000gn/T/tmpeujk60a4/output.fa -quiet
[DEBUG] `run_command` is running .............: muscle -in /var/folders/gw/5mdblzs94gsb1ss44llgl3_h0000gn/T/tmpzsgrynh1/input.fa -out /var/folders/gw/5mdblzs94gsb1ss44llgl3_h0000gn/T/tmpzsgrynh1/output.fa -quiet
[DEBUG] `run_command` is running .............: muscle -in /var/folders/gw/5mdblzs94gsb1ss44llgl3_h0000gn/T/tmpw6pz5p2q/input.fa -out /var/folders/gw/5mdblzs94gsb1ss44llgl3_h0000gn/T/tmpw6pz5p2q/output.fa -quiet
[DEBUG] `run_command` is running .............: muscle -in /var/folders/gw/5mdblzs94gsb1ss44llgl3_h0000gn/T/tmplyiqlqm8/input.fa -out /var/folders/gw/5mdblzs94gsb1ss44llgl3_h0000gn/T/tmplyiqlqm8/output.fa -quiet
[DEBUG] `run_command` is running .............: muscle -in /var/folders/gw/5mdblzs94gsb1ss44llgl3_h0000gn/T/tmpghyej_65/input.fa -out /var/folders/gw/5mdblzs94gsb1ss44llgl3_h0000gn/T/tmpghyej_65/output.fa -quiet
[DEBUG] `run_command` is running .............: muscle -in /var/folders/gw/5mdblzs94gsb1ss44llgl3_h0000gn/T/tmpl78tx6z5/input.fa -out /var/folders/gw/5mdblzs94gsb1ss44llgl3_h0000gn/T/tmpl78tx6z5/output.fa -quiet
And the last one before crashing will likely be the one that we will need to examine more carefully :)
Apologies for the inconvenience.
Hi @meren, FYI before your latest changes a filter of 10000
produced the same result. I am now retuning with your new changes and the --debug
flag with 10000
length filter. Will report back! Thank you for your continued help.
Hey @meren the attached file fails with a filter of 10000
. Here's the log:
WARNING
===============================================
You did not define any gene names. Bold move. Now anvi'o will attempt to report
a file with all genes defined in your HMM source(s). This will likely be quite
ugly, so please brace yourself.
[DEBUG] `run_command` is running .............: muscle -in /tmp/tmp708ev8ce/input.fa -out /tmp/tmp708ev8ce/output.fa -quiet
Traceback for debugging
================================================================================
File "/Accounts/gkanaan/github/anvio/bin/anvi-get-sequences-for-hmm-hits", line 343, in <module>
main(args)
File "/Accounts/gkanaan/github/anvio/bin/anvi-get-sequences-for-hmm-hits", line 263, in main
s.store_hmm_sequences_into_FASTA(hmm_sequences_dict,
File "/Accounts/gkanaan/github/anvio/anvio/hmmops.py", line 922, in store_hmm_sequences_into_FASTA
self.__store_concatenated_hmm_sequences_into_FASTA(hmm_sequences_dict_for_splits, output_file_path, partition_file_path, wrap, separator, genes_order, align_with, just_do_it)
File "/Accounts/gkanaan/github/anvio/anvio/hmmops.py", line 812, in __store_concatenated_hmm_sequences_into_FASTA
genes_in_bins_dict[gene_name] = aligner(run=terminal.Run(verbose=False)).run_default(genes_list, debug=anvio.DEBUG)
File "/Accounts/gkanaan/github/anvio/anvio/drivers/muscle.py", line 120, in run_default
output = utils.run_command(cmd_line, log_file_path)
File "/Accounts/gkanaan/github/anvio/anvio/utils.py", line 514, in run_command
raise ConfigError("Command failed to run. What command, you say? This: '%s'" % ' '.join(cmdline))
================================================================================
Config Error: Command failed to run. What command, you say? This: 'muscle -in
/tmp/tmp708ev8ce/input.fa -out /tmp/tmp708ev8ce/output.fa -quiet'
Hey @Ge0rges, thanks for this. I'm looking into it. In the meantime, you should find a log file under the directory /tmp/tmp708ev8ce/
. Do you see anything interesting in there? Can you please share that one too?
Hi @meren thanks for your time this again, that logo ice is empty other than # CMD LINE: muscle -in /tmp/tmp708ev8ce/input.fa -out /tmp/tmp708ev8ce/output.fa -quiet
Hey @Ge0rges, I just realized that the log is empty due to the -quiet
flag anvi'o is passing to muscle
:/ Here is a test and how it looks like on my end with that flag hardcoded in anvio/drivers/muscle.py
:
cat /var/folders/0w/z3rt2nfs3d7dxf6jkl8s75v00000gn/T/tmp3hgfccxv/00_log.txt
# DATE: 31 Jan 24 19:38:13
# CMD LINE: muscle -in /var/folders/0w/z3rt2nfs3d7dxf6jkl8s75v00000gn/T/tmp3hgfccxv/input.fa -out /var/folders/0w/z3rt2nfs3d7dxf6jkl8s75v00000gn/T/tmp3hgfccxv/output.fa -quiet
And here is the log file for the same command after removing it:
cat /var/folders/0w/z3rt2nfs3d7dxf6jkl8s75v00000gn/T/tmp7p4qe4px/00_log.txt
# DATE: 31 Jan 24 19:37:44
# CMD LINE: muscle -in /var/folders/0w/z3rt2nfs3d7dxf6jkl8s75v00000gn/T/tmp7p4qe4px/input.fa -out /var/folders/0w/z3rt2nfs3d7dxf6jkl8s75v00000gn/T/tmp7p4qe4px/output.fa
MUSCLE v3.8.1551 by Robert C. Edgar
http://www.drive5.com/muscle
This software is donated to the public domain.
Please cite: Edgar, R.C. Nucleic Acids Res 32(5), 1792-97.
input 8 seqs, lengths min 114, max 135, avg 121
00:00:00 1 MB(0%) Iter 1 100.00% K-mer dist pass 1
00:00:00 1 MB(0%) Iter 1 100.00% K-mer dist pass 2
00:00:00 2 MB(0%) Iter 1 100.00% Align node
00:00:00 2 MB(0%) Iter 1 100.00% Root alignment
00:00:00 2 MB(0%) Iter 2 100.00% Refine tree
00:00:00 2 MB(0%) Iter 2 100.00% Root alignment
00:00:00 2 MB(0%) Iter 2 100.00% Root alignment
00:00:00 2 MB(0%) Iter 3 100.00% Refine biparts
00:00:00 2 MB(0%) Iter 4 100.00% Refine biparts
00:00:00 2 MB(0%) Iter 5 100.00% Refine biparts
00:00:00 2 MB(0%) Iter 6 100.00% Refine biparts
I just committed this change, so when you run your command again after a git pull
you certainly will see something from muscle
for failed alignments. Apologies for all the back and forth here and thank you for your patience!
Here's anvio
's output:
[DEBUG] `run_command` is running .............: muscle -in /tmp/tmprlsurk6m/input.fa -out /tmp/tmprlsurk6m/output.fa
Traceback for debugging
================================================================================
File "/Accounts/gkanaan/github/anvio/bin/anvi-get-sequences-for-hmm-hits", line 343, in <module>
main(args)
File "/Accounts/gkanaan/github/anvio/bin/anvi-get-sequences-for-hmm-hits", line 263, in main
s.store_hmm_sequences_into_FASTA(hmm_sequences_dict,
File "/Accounts/gkanaan/github/anvio/anvio/hmmops.py", line 922, in store_hmm_sequences_into_FASTA
self.__store_concatenated_hmm_sequences_into_FASTA(hmm_sequences_dict_for_splits, output_file_path, partition_file_path, wrap, separator, genes_order, align_with, just_do_it)
File "/Accounts/gkanaan/github/anvio/anvio/hmmops.py", line 812, in __store_concatenated_hmm_sequences_into_FASTA
genes_in_bins_dict[gene_name] = aligner(run=terminal.Run(verbose=False)).run_default(genes_list, debug=anvio.DEBUG)
File "/Accounts/gkanaan/github/anvio/anvio/drivers/muscle.py", line 120, in run_default
output = utils.run_command(cmd_line, log_file_path)
File "/Accounts/gkanaan/github/anvio/anvio/utils.py", line 514, in run_command
raise ConfigError("Command failed to run. What command, you say? This: '%s'" % ' '.join(cmdline))
================================================================================
Config Error: Command failed to run. What command, you say? This: 'muscle -in
/tmp/tmprlsurk6m//input.fa -out /tmp/tmprlsurk6m/output.fa
Attached is the input.fa
(renamed .txt for GitHub upload) located in /tmp/tmprlsurk6m/
.
input.txt
The log file is huge because many lines are just copy of themselves with the timestamp updated. So here's the output instead, of cat 00_log.txt
in that same directory. I manually confirmed the last line of the file resembled what cat
printed in case it omitted errors for some reason.
# DATE: 31 Jan 24 18:50:12
# CMD LINE: muscle -in /tmp/tmprlsurk6m/input.fa -out /tmp/tmprlsurk6m/output.fa
MUSCLE v3.8.1551 by Robert C. Edgar
http://www.drive5.com/muscle
This software is donated to the public domain.
Please cite: Edgar, R.C. Nucleic Acids Res 32(5), 1792-97.
input 39448 seqs, lengths min 61, max 2119, avg 352
01:32:55 12555 MB(-786%) Iter 1 100.00% K-mer dist pass 1
01:34:45 12555 MB(-786%) Iter 1 100.00% K-mer dist pass 2
01:41:04 12555 MB(-786%) Iter 1 100.00% Align node
01:41:20 12555 MB(-786%) Iter 1 100.00% Root alignment
28:46:02 12555 MB(-786%) Iter 2 100.00% Refine tree
28:46:14 12555 MB(-786%) Iter 2 100.00% Root alignment
28:46:21 12555 MB(-786%) Iter 2 100.00% Root alignment
Seems to me like it's running fine, which is making me wonder if perhaps this isn't something ridiculous on my end like running out of memory (although the server has over 1TB of RAM...).
I downloaded your input.txt
and did the following to see what the output looks like:
$ mv input.txt input.orig
$ head -n 100 input.orig > input.txt
$ muscle -in input.txt -out input.aligned
The output for this looked like this:
MUSCLE v3.8.1551 by Robert C. Edgar
http://www.drive5.com/muscle
This software is donated to the public domain.
Please cite: Edgar, R.C. Nucleic Acids Res 32(5), 1792-97.
input 500 seqs, lengths min 124, max 695, avg 348
00:00:00 4 MB(0%) Iter 1 100.00% K-mer dist pass 1
00:00:00 4 MB(0%) Iter 1 100.00% K-mer dist pass 2
00:00:02 152 MB(0%) Iter 1 100.00% Align node
00:00:02 153 MB(0%) Iter 1 100.00% Root alignment
00:00:06 160 MB(0%) Iter 2 100.00% Refine tree
00:00:06 162 MB(0%) Iter 2 100.00% Root alignment
00:00:06 162 MB(0%) Iter 2 100.00% Root alignment
00:01:39 203 MB(1%) Iter 3 100.00% Refine biparts
00:03:01 207 MB(1%) Iter 4 100.00% Refine biparts
00:04:24 212 MB(1%) Iter 5 100.00% Refine biparts
00:05:45 217 MB(1%) Iter 6 100.00% Refine biparts
00:05:50 217 MB(1%) Iter 7 100.00% Refine biparts
00:05:50 217 MB(1%) Iter 7 100.00% Refine biparts
Which is different than yours. I'm wondering about a few things:
-786%
looks weird to me as I'm not sure why there would be a negative number there unless there was a misunderstanding about the actual memory and available memory somehow :/ In my output memory is calculated appropriately. There is also a big jum from Iter 2 to Iter 3 in my case. I wonder if that is why the job fails right at the end of Iter 2.muscle
has maxiters and maxhours parameters that can be set. Without adding new parameters to each program, I can add an environmental varaible that could be used like this, and anvi'o could append them to the final muscle
command line to ensure a successful finish:export MUSCLE_PARAMS="-maxhours 6 -maxiters 2"
In addition to the last idea, can you imagine a way to improve things on anvi'o side?
I now added this last idea to master
if you like to try with -maxiters 2
. It seems to be working as expected on my test case:
$ export MUSCLE_PARAMS="-maxiters 2"
$ anvi-get-sequences-for-hmm-hits -c CONTIGS.db --hmm-source Bacteria_71 --gene-names "Ribosomal_S13,RNA_pol_L" --concat --return-best-hit --get-aa-sequences --no-wrap --debug -p PROFILE.db -C default
(...)
[DEBUG] `run_command` is running .............: muscle -in /var/folders/0w/z3rt2nfs3d7dxf6jkl8s75v00000gn/T/tmpov2aji31/input.fa -out /var/folders/0w/z3rt2nfs3d7dxf6jkl8s75v00000gn/T/tmpov2aji31/output.fa -maxiters 2
(...)
$ cat /var/folders/0w/z3rt2nfs3d7dxf6jkl8s75v00000gn/T/tmpqs05cuat/00_log.txt
# DATE: 03 Feb 24 12:48:28
# CMD LINE: muscle -in /var/folders/0w/z3rt2nfs3d7dxf6jkl8s75v00000gn/T/tmpqs05cuat/input.fa -out /var/folders/0w/z3rt2nfs3d7dxf6jkl8s75v00000gn/T/tmpqs05cuat/output.fa -maxiters 2
MUSCLE v3.8.1551 by Robert C. Edgar
http://www.drive5.com/muscle
This software is donated to the public domain.
Please cite: Edgar, R.C. Nucleic Acids Res 32(5), 1792-97.
input 8 seqs, lengths min 114, max 135, avg 121
00:00:00 1 MB(0%) Iter 1 100.00% K-mer dist pass 1
00:00:00 1 MB(0%) Iter 1 100.00% K-mer dist pass 2
00:00:00 2 MB(0%) Iter 1 100.00% Align node
00:00:00 2 MB(0%) Iter 1 100.00% Root alignment
00:00:00 2 MB(0%) Iter 2 100.00% Refine tree
00:00:00 2 MB(0%) Iter 2 100.00% Root alignment
00:00:00 2 MB(0%) Iter 2 100.00% Root alignment
Hi @meren seems like this is not an issue with Anvi'o. I tried running the same command as you (I weirdly do get negative usage numbers) but eventually run into a segfault in muscle. Likely due to the size and type of the data. Guess that's what I get for trying to align all the viral defense genes in all the GTDB genomes! Sorry for wasting your time on this, I will continue looking for an alternative on my end.
Thanks for looking into this further, @Ge0rges. I wonder if it is worthwhile trying it with famsa
. If it works, anvi'o has a driver for it as well and you can change the aligner through a command-line option for a workflow.
Otherwise, it may be possible to use something like MMSeqs2 to cluster the sequences first at a level of 80% similarity or something, and then align individually the top 5-6 clusters which would likely make up over 90% of sequences.
As the number of genomes increase, we will continue running into issues like this so this was very helpful to prepare for those times and I am sorry anvi'o couldn't be more helpful to you :) But if you would like to investigate alternative approaches together and bounce ideas with someone, I'm here for that.
Best wishes,
Hi @meren, I'm definitely willing to be your guinea pig here if that's more widely useful. I will say that with maxiters 2, I was able to get my task to complete. I'm unsure however how that has affected the alignment quality (since the better one doesn't complete and I do not have a ground truth). I will try famsa
and report back.
Hi @meren, sorry for the delay on this, but I just got to trying famsa
and got this error:
Traceback for debugging
================================================================================
File "/Accounts/gkanaan/github/anvio/bin/anvi-get-sequences-for-hmm-hits", line 343, in <module>
main(args)
File "/Accounts/gkanaan/github/anvio/bin/anvi-get-sequences-for-hmm-hits", line 263, in main
s.store_hmm_sequences_into_FASTA(hmm_sequences_dict,
File "/Accounts/gkanaan/github/anvio/anvio/hmmops.py", line 922, in store_hmm_sequences_into_FASTA
self.__store_concatenated_hmm_sequences_into_FASTA(hmm_sequences_dict_for_splits, output_file_path, partition_file_path, wrap, separator, genes_order, align_with, just_do_it)
File "/Accounts/gkanaan/github/anvio/anvio/hmmops.py", line 812, in __store_concatenated_hmm_sequences_into_FASTA
genes_in_bins_dict[gene_name] = aligner(run=terminal.Run(verbose=False)).run_default(genes_list, debug=anvio.DEBUG)
File "/Accounts/gkanaan/github/anvio/anvio/drivers/famsa.py", line 107, in run_default
raise ConfigError("Default alignment option for Famsa is not implemented :/")
================================================================================
Config Error: Default alignment option for Famsa is not implemented :/
I will try to implement this ASAP. It'll probably take a few days :(
Short description of the problem
anvi-get-sequences-for-hmm-hits
fails saying the expected output from muscle is missing when large genes are present.anvi'o version
System info
Installed according to provided instructions on Rocky Linux.
Detailed description of the issue
Running
anvi-get-sequences-for-hmm-hits --external-genomes external-genomes.txt --hmm-source RM_DefenseFinder --return-best-hit --get-aa-sequences --concatenate -o concatenated-rm-proteins.fa
on around 85 000 databases. The command fails saying the expected output from muscle is missing. When I look at the referenced log, it is devoid of errors and just lists the command that was runmuscle -quiet
.This was originally posted on Discord, and moved here following a suggestion by meren.
@meren stated:
A conversation then followed regarding potential solutions the most relevant (and recent) exchange is quoted below.
Meren:
Me: