Closed jarrodscott closed 7 years ago
Hi Jarrod,
I am having hard time replicating this error :/ I wonder why. Here is an example. The first clean run:
$ anvi-pan-genome -g TEST-GENOMES.h5 -o TEST/ -J TEST
WARNING
===============================================
If you publish results from this workflow, please do not forget to cite DIAMOND
(doi:10.1038/nmeth.3176), unless you use it with --use-ncbi-blast flag, and MCL
(http://micans.org/mcl/ and doi:10.1007/978-1-61779-361-5_15)
Genomes storage ..............................: Initialized (storage hash: 801e9551)
Num genomes in storage .......................: 3
Num genomes will be used .....................: 3
Pan database .................................: A new database, /Users/meren/github/anvio/tests/sandbox/test-output/pan_test/TEST/TEST-PAN.db, has been created.
Exclude partial gene calls ...................: False
Unique protein sequences FASTA ...............: /Users/meren/github/anvio/tests/sandbox/test-output/pan_test/TEST/combined-proteins.fa
Num protein sequences reported ...............: 355
Num excluded gene calls ......................: 0
Diamond search db ............................: /Users/meren/github/anvio/tests/sandbox/test-output/pan_test/TEST/combined-proteins.fa.dmnd
DIAMOND is set to be .........................: Fast
Diamond blastp results .......................: /Users/meren/github/anvio/tests/sandbox/test-output/pan_test/TEST/diamond-search-results.daa
Diamond un-uniquedtabular output file ........: /Users/meren/github/anvio/tests/sandbox/test-output/pan_test/TEST/diamond-search-results.txt
Min percent identity .........................: 0.0
Maxbit .......................................: 0.5
Filtered search results ......................: 1,051 edges stored
MCL input ....................................: /Users/meren/github/anvio/tests/sandbox/test-output/pan_test/TEST/mcl-input.txt
MCL inflation ................................: 2.0
MCL output ...................................: /Users/meren/github/anvio/tests/sandbox/test-output/pan_test/TEST/mcl-clusters.txt
Number of protein clusters ...................: 120
protein clusters info ........................: 120 PCs stored in the database
New hierarchical clusetring ..................: "frequency:euclidean:ward" has been added to the database...
New hierarchical clusetring ..................: "presence-absence:euclidean:ward" has been added to the database...
Anvi'o samples information ...................: /Users/meren/github/anvio/tests/sandbox/test-output/pan_test/TEST/TEST-samples-information.txt
Anvi'o samples order .........................: /Users/meren/github/anvio/tests/sandbox/test-output/pan_test/TEST/TEST-samples-order.txt
log file .....................................: /Users/meren/github/anvio/tests/sandbox/test-output/pan_test/TEST/log.txt
And this is what I see when I run the following command:
$ anvi-display-pan -p TEST/TEST-PAN.db -s TEST/TEST-SAMPLES.db
Then I run a second one with a different --min-occurrence
param:
$ anvi-pan-genome -g TEST-GENOMES.h5 -o TEST/ -J PFT --min-occurrence 2
WARNING
===============================================
If you publish results from this workflow, please do not forget to cite DIAMOND
(doi:10.1038/nmeth.3176), unless you use it with --use-ncbi-blast flag, and MCL
(http://micans.org/mcl/ and doi:10.1007/978-1-61779-361-5_15)
Genomes storage ..............................: Initialized (storage hash: 801e9551)
Num genomes in storage .......................: 3
Num genomes will be used .....................: 3
Pan database .................................: A new database, /Users/meren/github/anvio/tests/sandbox/test-output/pan_test/TEST/PFT-PAN.db, has been created.
Exclude partial gene calls ...................: False
Unique protein sequences FASTA ...............: /Users/meren/github/anvio/tests/sandbox/test-output/pan_test/TEST/combined-proteins.fa
Num protein sequences reported ...............: 355
Num excluded gene calls ......................: 0
WARNING
===============================================
Notice: A diamond database is found in the output directory, and will be used!
WARNING
===============================================
Notice: A DIAMOND search result is found in the output directory: skipping
BLASTP!
WARNING
===============================================
Notice: A DIAMOND tabular output is found in the output directory. Anvi'o will
not generate another one!
Min percent identity .........................: 0.0
Maxbit .......................................: 0.5
Filtered search results ......................: 1,051 edges stored
MCL input ....................................: /Users/meren/github/anvio/tests/sandbox/test-output/pan_test/TEST/mcl-input.txt
MCL inflation ................................: 2.0
MCL output ...................................: /Users/meren/github/anvio/tests/sandbox/test-output/pan_test/TEST/mcl-clusters.txt
Number of protein clusters ...................: 120
protein clusters info ........................: 120 PCs stored in the database
PCs min occurrence ...........................: 2 (the filter removed 3 PCs)
New hierarchical clusetring ..................: "frequency:euclidean:ward" has been added to the database...
New hierarchical clusetring ..................: "presence-absence:euclidean:ward" has been added to the database...
Anvi'o samples information ...................: /Users/meren/github/anvio/tests/sandbox/test-output/pan_test/TEST/PFT-samples-information.txt
Anvi'o samples order .........................: /Users/meren/github/anvio/tests/sandbox/test-output/pan_test/TEST/PFT-samples-order.txt
log file .....................................: /Users/meren/github/anvio/tests/sandbox/test-output/pan_test/TEST/log.txt
It does find previous search results, and uses them, and the following command gives me the display underneath:
$ anvi-display-pan -p TEST/PFT-PAN.db -s TEST/PFT-SAMPLES.db --title PFT
I will continue to look into this, but this is where I am.
Is it possible that you may have changed the genomes storage? Is woese-GENOMES.h5
is identical to the one you used for the previous analysis?
Well, I changed the genomes storage and tried again, and this is what I got back:
Config Error: Something horrible happened. This can only happen if you started a new analysis
with additional genomes without cleaning the previous work directory. Sounds
familiar?
So probably it is not what you did. When is the last time did you git pull?
Hi Meren,
I have not updated since the last "stable" release of the unstable release 2.0.3. Looks like you have made changes in the codebase. I will do that next.
In the meantime, I did two things:
1) ran the pipeline as you did without --use-ncbi-blast
flag. Seemed to work great--see below
2) ran with --use-ncbi-blast
flag.
I am only including the terminal output in each case for the second command where we change the --min-occurrence
to 2
1)
$anvi-gen-genomes-storage -e external_genomes.txt -o woese-GENOMES.h5
$anvi-pan-genome -g woese-GENOMES.h5 -o woese/ -J woese --num-threads 8
and then...
$ anvi-pan-genome -g woese-GENOMES.h5 -o woese/ -J woese2 --num-threads 8 --min-occurrence 2
WARNING
===============================================
If you publish results from this workflow, please do not forget to cite DIAMOND
(doi:10.1038/nmeth.3176), unless you use it with --use-ncbi-blast flag, and MCL
(http://micans.org/mcl/ and doi:10.1007/978-1-61779-361-5_15)
Genomes storage ..............................: Initialized (storage hash: 9d3210ab)
Num genomes in storage .......................: 33
Num genomes will be used .....................: 33
Pan database .................................: A new database, /home/j/Documents/Analytics/anvio_processing/Eric_becraft/woese/woese2-PAN.db, has been created.
Exclude partial gene calls ...................: False
Unique protein sequences FASTA ...............: /home/j/Documents/Analytics/anvio_processing/Eric_becraft/woese/combined-proteins.fa
Num protein sequences reported ...............: 23,953
Num excluded gene calls ......................: 0
WARNING
===============================================
Notice: A diamond database is found in the output directory, and will be used!
WARNING
===============================================
Notice: A DIAMOND search result is found in the output directory: skipping
BLASTP!
WARNING
===============================================
Notice: A DIAMOND tabular output is found in the output directory. Anvi'o will
not generate another one!
Min percent identity .........................: 0.0
Maxbit .......................................: 0.5
Filtered search results ......................: 72,247 edges stored
MCL input ....................................: /home/j/Documents/Analytics/anvio_processing/Eric_becraft/woese/mcl-input.txt
MCL inflation ................................: 2.0
MCL output ...................................: /home/j/Documents/Analytics/anvio_processing/Eric_becraft/woese/mcl-clusters.txt
Number of protein clusters ...................: 16,781
protein clusters info ........................: 16781 PCs stored in the database
PCs min occurrence ...........................: 2 (the filter removed 14390 PCs)
New hierarchical clusetring ..................: "frequency:euclidean:ward" has been added to the database...
[20 Dec 16 21:15:05 Vectors from 2 matrices] Recovering the tree from the clustering result
WARNING
===============================================
Clustering has failed for "presence-absence": "Linkage 'Z' uses the same cluster
more than once."
Anvi'o samples information ...................: /home/j/Documents/Analytics/anvio_processing/Eric_becraft/woese/woese2-samples-information.txt
Anvi'o samples order .........................: /home/j/Documents/Analytics/anvio_processing/Eric_becraft/woese/woese2-samples-order.txt
log file .....................................: /home/j/Documents/Analytics/anvio_processing/Eric_becraft/woese/log.txt
looks like it worked fine--aside from the Linkage 'Z'
error which could be my scipy version (0.18.1)?
2)
Next I ran the whole process again but added the --use-ncbi-blast
flag
$ anvi-pan-genome -g woese-GENOMES.h5 -o woese/ -J woese --num-threads 8 --use-ncbi-blast
$ anvi-pan-genome -g woese-GENOMES.h5 -o woese/ -J woese2 --num-threads 8 --use-ncbi-blast --min-occurrence 2
WARNING
===============================================
If you publish results from this workflow, please do not forget to cite DIAMOND
(doi:10.1038/nmeth.3176), unless you use it with --use-ncbi-blast flag, and MCL
(http://micans.org/mcl/ and doi:10.1007/978-1-61779-361-5_15)
Genomes storage ..............................: Initialized (storage hash: 9d3210ab)
Num genomes in storage .......................: 33
Num genomes will be used .....................: 33
Pan database .................................: A new database, /home/j/Documents/Analytics/anvio_processing/Eric_becraft/woese/woese2-PAN.db, has been created.
Exclude partial gene calls ...................: False
Unique protein sequences FASTA ...............: /home/j/Documents/Analytics/anvio_processing/Eric_becraft/woese/combined-proteins.fa
Num protein sequences reported ...............: 23,953
Num excluded gene calls ......................: 0
WARNING
===============================================
You elected to use NCBI's blastp for protein search. Running blastp will be
significantly slower than DIAMOND (although, anvi'o developers are convinced
that you *are* doing the right thing, so, kudos to you).
WARNING
===============================================
Notice: A BLAST database is found in the output directory, and will be used!
WARNING
===============================================
Notice: A BLAST search result is found in the output directory: skipping BLASTP!
[20 Dec 16 21:53:12 BLAST] Un-uniqueing the tabular output ... Traceback (most recent call last):
File "/usr/local/bin/anvi-pan-genome", line 4, in <module>
__import__('pkg_resources').run_script('anvio==2.0.3', 'anvi-pan-genome')
File "/usr/local/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 719, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/local/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 1504, in run_script
exec(code, namespace, namespace)
File "/usr/local/lib/python2.7/dist-packages/anvio-2.0.3-py2.7-linux-x86_64.egg/EGG-INFO/scripts/anvi-pan-genome", line 99, in <module>
pan.process()
File "/usr/local/lib/python2.7/dist-packages/anvio-2.0.3-py2.7-linux-x86_64.egg/anvio/panops.py", line 1033, in process
blastall_results = self.run_search(unique_proteins_FASTA_path, unique_proteins_names_dict)
File "/usr/local/lib/python2.7/dist-packages/anvio-2.0.3-py2.7-linux-x86_64.egg/anvio/panops.py", line 560, in run_search
return self.run_blast(unique_proteins_fasta_path, unique_proteins_names_dict)
File "/usr/local/lib/python2.7/dist-packages/anvio-2.0.3-py2.7-linux-x86_64.egg/anvio/panops.py", line 555, in run_blast
return blast.get_blastall_results()
File "/usr/local/lib/python2.7/dist-packages/anvio-2.0.3-py2.7-linux-x86_64.egg/anvio/drivers/blast.py", line 76, in get_blastall_results
self.ununique_search_results()
File "/usr/local/lib/python2.7/dist-packages/anvio-2.0.3-py2.7-linux-x86_64.egg/anvio/drivers/blast.py", line 134, in ununique_search_results
utils.ununique_BLAST_tabular_output(self.search_output_path, self.names_dict)
File "/usr/local/lib/python2.7/dist-packages/anvio-2.0.3-py2.7-linux-x86_64.egg/anvio/utils.py", line 971, in ununique_BLAST_tabular_output
for subject_id in names_dict[fields[1]]:
KeyError: '8622f82b_991'
ok, so I run this command again without --use-ncbi-blast
but still changing --min-occurrence 2
$ anvi-pan-genome -g woese-GENOMES.h5 -o woese/ -J woese2 --num-threads 8 --min-occurrence 2
here is the output.
WARNING
===============================================
If you publish results from this workflow, please do not forget to cite DIAMOND
(doi:10.1038/nmeth.3176), unless you use it with --use-ncbi-blast flag, and MCL
(http://micans.org/mcl/ and doi:10.1007/978-1-61779-361-5_15)
Genomes storage ..............................: Initialized (storage hash: 9d3210ab)
Num genomes in storage .......................: 33
Num genomes will be used .....................: 33
Pan database .................................: A new database, /home/j/Documents/Analytics/anvio_processing/Eric_becraft/woese/woese2-PAN.db, has been created.
Exclude partial gene calls ...................: False
Unique protein sequences FASTA ...............: /home/j/Documents/Analytics/anvio_processing/Eric_becraft/woese/combined-proteins.fa
Num protein sequences reported ...............: 23,953
Num excluded gene calls ......................: 0
Diamond search db ............................: /home/j/Documents/Analytics/anvio_processing/Eric_becraft/woese/combined-proteins.fa.dmnd
DIAMOND is set to be .........................: Fast
Diamond blastp results .......................: /home/j/Documents/Analytics/anvio_processing/Eric_becraft/woese/diamond-search-results.daa
Diamond un-uniquedtabular output file ........: /home/j/Documents/Analytics/anvio_processing/Eric_becraft/woese/diamond-search-results.txt
Min percent identity .........................: 0.0
Maxbit .......................................: 0.5
Filtered search results ......................: 72,247 edges stored
MCL input ....................................: /home/j/Documents/Analytics/anvio_processing/Eric_becraft/woese/mcl-input.txt
MCL inflation ................................: 2.0
MCL output ...................................: /home/j/Documents/Analytics/anvio_processing/Eric_becraft/woese/mcl-clusters.txt
Number of protein clusters ...................: 16,781
protein clusters info ........................: 16781 PCs stored in the database
PCs min occurrence ...........................: 2 (the filter removed 14390 PCs)
New hierarchical clusetring ..................: "frequency:euclidean:ward" has been added to the database...
[20 Dec 16 22:01:50 Vectors from 2 matrices] Recovering the tree from the clustering result
WARNING
===============================================
Clustering has failed for "presence-absence": "Linkage 'Z' uses the same cluster
more than once."
Anvi'o samples information ...................: /home/j/Documents/Analytics/anvio_processing/Eric_becraft/woese/woese2-samples-information.txt
Anvi'o samples order .........................: /home/j/Documents/Analytics/anvio_processing/Eric_becraft/woese/woese2-samples-order.txt
log file .....................................: /home/j/Documents/Analytics/anvio_processing/Eric_becraft/woese/log.txt
which looks like it ran the whole process again just using DIAMOND instead?
I will update the code and see about downgrading scipy :)
This is very helpful! Thank you very much. I will look into this soon.
If you update your master
, this should work now, Jarrod. It took hours to find what was wrong, and ironically, this little change was enough to solve it: 3d198ba934608c42878b6b0f887c7cd6bb117f60.
Thank you very much for your help with this one.
No problem Meren. updated and re-running now. if you don't hear from me it means all went well.
And thank you for solving this problem!
Trying to re-run pangenome analysis using different parameters for --min-occurrence, mcl-inflation, and --maxbit.
:: Anvo'o version ...
Original command:
1) if I change nothing anvio yells at me about overwriting an existing database. Thanks anvio! 2) if I change the -J flag (project name) anvio throws the following error: