Re-running Pan-genome with different parameters

jarrodscott commented 7 years ago

Trying to re-run pangenome analysis using different parameters for --min-occurrence, mcl-inflation, and --maxbit.

:: Anvo'o version ...

Anvi'o version ...............................: 2.0.3
Profile DB version ...........................: 16
Contigs DB version ...........................: 7
Pan DB version ...............................: 4
Samples information DB version ...............: 2
Genome data storage version ..................: 1
Auxiliary data storage version ...............: 2
Anvi'server users data storage version .......: 1

Original command:

anvi-pan-genome -g woese-GENOMES.h5 -o woese --num-threads 8 --min-occurrence 1 --sensitive --mcl-inflation 2.0 --maxbit 0.5 --use-ncbi-blast -J woese

1) if I change nothing anvio yells at me about overwriting an existing database. Thanks anvio! 2) if I change the -J flag (project name) anvio throws the following error:

WARNING
===============================================
Notice: A BLAST database is found in the output directory, and will be used!

WARNING
===============================================
Notice: A BLAST search result is found in the output directory: skipping BLASTP!

[20 Dec 16 14:06:01 BLAST] Un-uniqueing the tabular output ...                                                                                                                    Traceback (most recent call last):
  File "/usr/local/bin/anvi-pan-genome", line 4, in <module>
    __import__('pkg_resources').run_script('anvio==2.0.3', 'anvi-pan-genome')
  File "/usr/local/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 719, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/local/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 1504, in run_script
    exec(code, namespace, namespace)
  File "/usr/local/lib/python2.7/dist-packages/anvio-2.0.3-py2.7-linux-x86_64.egg/EGG-INFO/scripts/anvi-pan-genome", line 99, in <module>
    pan.process()
  File "/usr/local/lib/python2.7/dist-packages/anvio-2.0.3-py2.7-linux-x86_64.egg/anvio/panops.py", line 1033, in process
    blastall_results = self.run_search(unique_proteins_FASTA_path, unique_proteins_names_dict)
  File "/usr/local/lib/python2.7/dist-packages/anvio-2.0.3-py2.7-linux-x86_64.egg/anvio/panops.py", line 560, in run_search
    return self.run_blast(unique_proteins_fasta_path, unique_proteins_names_dict)
  File "/usr/local/lib/python2.7/dist-packages/anvio-2.0.3-py2.7-linux-x86_64.egg/anvio/panops.py", line 555, in run_blast
    return blast.get_blastall_results()
  File "/usr/local/lib/python2.7/dist-packages/anvio-2.0.3-py2.7-linux-x86_64.egg/anvio/drivers/blast.py", line 76, in get_blastall_results
    self.ununique_search_results()
  File "/usr/local/lib/python2.7/dist-packages/anvio-2.0.3-py2.7-linux-x86_64.egg/anvio/drivers/blast.py", line 134, in ununique_search_results
    utils.ununique_BLAST_tabular_output(self.search_output_path, self.names_dict)
  File "/usr/local/lib/python2.7/dist-packages/anvio-2.0.3-py2.7-linux-x86_64.egg/anvio/utils.py", line 971, in ununique_BLAST_tabular_output
    for subject_id in names_dict[fields[1]]:
KeyError: '8622f82b_991'

meren commented 7 years ago

Hi Jarrod,

I am having hard time replicating this error :/ I wonder why. Here is an example. The first clean run:

$ anvi-pan-genome -g TEST-GENOMES.h5 -o TEST/ -J TEST

WARNING
===============================================
If you publish results from this workflow, please do not forget to cite DIAMOND
(doi:10.1038/nmeth.3176), unless you use it with --use-ncbi-blast flag, and MCL
(http://micans.org/mcl/ and doi:10.1007/978-1-61779-361-5_15)

Genomes storage ..............................: Initialized (storage hash: 801e9551)
Num genomes in storage .......................: 3
Num genomes will be used .....................: 3
Pan database .................................: A new database, /Users/meren/github/anvio/tests/sandbox/test-output/pan_test/TEST/TEST-PAN.db, has been created.
Exclude partial gene calls ...................: False

Unique protein sequences FASTA ...............: /Users/meren/github/anvio/tests/sandbox/test-output/pan_test/TEST/combined-proteins.fa

Num protein sequences reported ...............: 355
Num excluded gene calls ......................: 0
Diamond search db ............................: /Users/meren/github/anvio/tests/sandbox/test-output/pan_test/TEST/combined-proteins.fa.dmnd
DIAMOND is set to be .........................: Fast
Diamond blastp results .......................: /Users/meren/github/anvio/tests/sandbox/test-output/pan_test/TEST/diamond-search-results.daa
Diamond un-uniquedtabular output file ........: /Users/meren/github/anvio/tests/sandbox/test-output/pan_test/TEST/diamond-search-results.txt
Min percent identity .........................: 0.0
Maxbit .......................................: 0.5
Filtered search results ......................: 1,051 edges stored
MCL input ....................................: /Users/meren/github/anvio/tests/sandbox/test-output/pan_test/TEST/mcl-input.txt
MCL inflation ................................: 2.0
MCL output ...................................: /Users/meren/github/anvio/tests/sandbox/test-output/pan_test/TEST/mcl-clusters.txt
Number of protein clusters ...................: 120
protein clusters info ........................: 120 PCs stored in the database
New hierarchical clusetring ..................: "frequency:euclidean:ward" has been added to the database...
New hierarchical clusetring ..................: "presence-absence:euclidean:ward" has been added to the database...
Anvi'o samples information ...................: /Users/meren/github/anvio/tests/sandbox/test-output/pan_test/TEST/TEST-samples-information.txt
Anvi'o samples order .........................: /Users/meren/github/anvio/tests/sandbox/test-output/pan_test/TEST/TEST-samples-order.txt
log file .....................................: /Users/meren/github/anvio/tests/sandbox/test-output/pan_test/TEST/log.txt

And this is what I see when I run the following command:

$ anvi-display-pan -p TEST/TEST-PAN.db -s TEST/TEST-SAMPLES.db

Then I run a second one with a different --min-occurrence param:

$ anvi-pan-genome -g TEST-GENOMES.h5 -o TEST/ -J PFT --min-occurrence 2

WARNING
===============================================
If you publish results from this workflow, please do not forget to cite DIAMOND
(doi:10.1038/nmeth.3176), unless you use it with --use-ncbi-blast flag, and MCL
(http://micans.org/mcl/ and doi:10.1007/978-1-61779-361-5_15)

Genomes storage ..............................: Initialized (storage hash: 801e9551)
Num genomes in storage .......................: 3
Num genomes will be used .....................: 3
Pan database .................................: A new database, /Users/meren/github/anvio/tests/sandbox/test-output/pan_test/TEST/PFT-PAN.db, has been created.
Exclude partial gene calls ...................: False

Unique protein sequences FASTA ...............: /Users/meren/github/anvio/tests/sandbox/test-output/pan_test/TEST/combined-proteins.fa

Num protein sequences reported ...............: 355
Num excluded gene calls ......................: 0

WARNING
===============================================
Notice: A diamond database is found in the output directory, and will be used!

WARNING
===============================================
Notice: A DIAMOND search result is found in the output directory: skipping
BLASTP!

WARNING
===============================================
Notice: A DIAMOND tabular output is found in the output directory. Anvi'o will
not generate another one!

Min percent identity .........................: 0.0
Maxbit .......................................: 0.5
Filtered search results ......................: 1,051 edges stored
MCL input ....................................: /Users/meren/github/anvio/tests/sandbox/test-output/pan_test/TEST/mcl-input.txt
MCL inflation ................................: 2.0
MCL output ...................................: /Users/meren/github/anvio/tests/sandbox/test-output/pan_test/TEST/mcl-clusters.txt
Number of protein clusters ...................: 120
protein clusters info ........................: 120 PCs stored in the database
PCs min occurrence ...........................: 2 (the filter removed 3 PCs)
New hierarchical clusetring ..................: "frequency:euclidean:ward" has been added to the database...
New hierarchical clusetring ..................: "presence-absence:euclidean:ward" has been added to the database...
Anvi'o samples information ...................: /Users/meren/github/anvio/tests/sandbox/test-output/pan_test/TEST/PFT-samples-information.txt
Anvi'o samples order .........................: /Users/meren/github/anvio/tests/sandbox/test-output/pan_test/TEST/PFT-samples-order.txt
log file .....................................: /Users/meren/github/anvio/tests/sandbox/test-output/pan_test/TEST/log.txt

It does find previous search results, and uses them, and the following command gives me the display underneath:

$ anvi-display-pan -p TEST/PFT-PAN.db -s TEST/PFT-SAMPLES.db --title PFT

I will continue to look into this, but this is where I am.

meren commented 7 years ago

Is it possible that you may have changed the genomes storage? Is woese-GENOMES.h5 is identical to the one you used for the previous analysis?

meren commented 7 years ago

Well, I changed the genomes storage and tried again, and this is what I got back:

Config Error: Something horrible happened. This can only happen if you started a new analysis
              with additional genomes without cleaning the previous work directory. Sounds
              familiar?

So probably it is not what you did. When is the last time did you git pull?

jarrodscott commented 7 years ago

Hi Meren,

I have not updated since the last "stable" release of the unstable release 2.0.3. Looks like you have made changes in the codebase. I will do that next.

In the meantime, I did two things: 1) ran the pipeline as you did without --use-ncbi-blast flag. Seemed to work great--see below 2) ran with --use-ncbi-blast flag.

I am only including the terminal output in each case for the second command where we change the --min-occurrence to 2

1) $anvi-gen-genomes-storage -e external_genomes.txt -o woese-GENOMES.h5 $anvi-pan-genome -g woese-GENOMES.h5 -o woese/ -J woese --num-threads 8 and then... $ anvi-pan-genome -g woese-GENOMES.h5 -o woese/ -J woese2 --num-threads 8 --min-occurrence 2

WARNING
===============================================
If you publish results from this workflow, please do not forget to cite DIAMOND
(doi:10.1038/nmeth.3176), unless you use it with --use-ncbi-blast flag, and MCL
(http://micans.org/mcl/ and doi:10.1007/978-1-61779-361-5_15)

Genomes storage ..............................: Initialized (storage hash: 9d3210ab)
Num genomes in storage .......................: 33
Num genomes will be used .....................: 33
Pan database .................................: A new database, /home/j/Documents/Analytics/anvio_processing/Eric_becraft/woese/woese2-PAN.db, has been created.
Exclude partial gene calls ...................: False

Unique protein sequences FASTA ...............: /home/j/Documents/Analytics/anvio_processing/Eric_becraft/woese/combined-proteins.fa

Num protein sequences reported ...............: 23,953
Num excluded gene calls ......................: 0

WARNING
===============================================
Notice: A diamond database is found in the output directory, and will be used!

WARNING
===============================================
Notice: A DIAMOND search result is found in the output directory: skipping
BLASTP!

WARNING
===============================================
Notice: A DIAMOND tabular output is found in the output directory. Anvi'o will
not generate another one!

Min percent identity .........................: 0.0                                                            
Maxbit .......................................: 0.5
Filtered search results ......................: 72,247 edges stored                                            
MCL input ....................................: /home/j/Documents/Analytics/anvio_processing/Eric_becraft/woese/mcl-input.txt
MCL inflation ................................: 2.0
MCL output ...................................: /home/j/Documents/Analytics/anvio_processing/Eric_becraft/woese/mcl-clusters.txt
Number of protein clusters ...................: 16,781
protein clusters info ........................: 16781 PCs stored in the database                               
PCs min occurrence ...........................: 2 (the filter removed 14390 PCs)                               
New hierarchical clusetring ..................: "frequency:euclidean:ward" has been added to the database...   
[20 Dec 16 21:15:05 Vectors from 2 matrices] Recovering the tree from the clustering result                    
WARNING
===============================================
Clustering has failed for "presence-absence": "Linkage 'Z' uses the same cluster
more than once."

Anvi'o samples information ...................: /home/j/Documents/Analytics/anvio_processing/Eric_becraft/woese/woese2-samples-information.txt
Anvi'o samples order .........................: /home/j/Documents/Analytics/anvio_processing/Eric_becraft/woese/woese2-samples-order.txt
log file .....................................: /home/j/Documents/Analytics/anvio_processing/Eric_becraft/woese/log.txt

looks like it worked fine--aside from the Linkage 'Z' error which could be my scipy version (0.18.1)?

2) Next I ran the whole process again but added the --use-ncbi-blast flag

$ anvi-pan-genome -g woese-GENOMES.h5 -o woese/ -J woese --num-threads 8 --use-ncbi-blast $ anvi-pan-genome -g woese-GENOMES.h5 -o woese/ -J woese2 --num-threads 8 --use-ncbi-blast --min-occurrence 2

WARNING
===============================================
If you publish results from this workflow, please do not forget to cite DIAMOND
(doi:10.1038/nmeth.3176), unless you use it with --use-ncbi-blast flag, and MCL
(http://micans.org/mcl/ and doi:10.1007/978-1-61779-361-5_15)

Genomes storage ..............................: Initialized (storage hash: 9d3210ab)
Num genomes in storage .......................: 33
Num genomes will be used .....................: 33
Pan database .................................: A new database, /home/j/Documents/Analytics/anvio_processing/Eric_becraft/woese/woese2-PAN.db, has been created.
Exclude partial gene calls ...................: False

Unique protein sequences FASTA ...............: /home/j/Documents/Analytics/anvio_processing/Eric_becraft/woese/combined-proteins.fa                                            

Num protein sequences reported ...............: 23,953
Num excluded gene calls ......................: 0

WARNING
===============================================
You elected to use NCBI's blastp for protein search. Running blastp will be
significantly slower than DIAMOND (although, anvi'o developers are convinced
that you *are* doing the right thing, so, kudos to you).

WARNING
===============================================
Notice: A BLAST database is found in the output directory, and will be used!

WARNING
===============================================
Notice: A BLAST search result is found in the output directory: skipping BLASTP!

[20 Dec 16 21:53:12 BLAST] Un-uniqueing the tabular output ...                                                                                                                 Traceback (most recent call last):
  File "/usr/local/bin/anvi-pan-genome", line 4, in <module>
    __import__('pkg_resources').run_script('anvio==2.0.3', 'anvi-pan-genome')
  File "/usr/local/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 719, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/local/lib/python2.7/dist-packages/pkg_resources/__init__.py", line 1504, in run_script
    exec(code, namespace, namespace)
  File "/usr/local/lib/python2.7/dist-packages/anvio-2.0.3-py2.7-linux-x86_64.egg/EGG-INFO/scripts/anvi-pan-genome", line 99, in <module>
    pan.process()
  File "/usr/local/lib/python2.7/dist-packages/anvio-2.0.3-py2.7-linux-x86_64.egg/anvio/panops.py", line 1033, in process
    blastall_results = self.run_search(unique_proteins_FASTA_path, unique_proteins_names_dict)
  File "/usr/local/lib/python2.7/dist-packages/anvio-2.0.3-py2.7-linux-x86_64.egg/anvio/panops.py", line 560, in run_search
    return self.run_blast(unique_proteins_fasta_path, unique_proteins_names_dict)
  File "/usr/local/lib/python2.7/dist-packages/anvio-2.0.3-py2.7-linux-x86_64.egg/anvio/panops.py", line 555, in run_blast
    return blast.get_blastall_results()
  File "/usr/local/lib/python2.7/dist-packages/anvio-2.0.3-py2.7-linux-x86_64.egg/anvio/drivers/blast.py", line 76, in get_blastall_results
    self.ununique_search_results()
  File "/usr/local/lib/python2.7/dist-packages/anvio-2.0.3-py2.7-linux-x86_64.egg/anvio/drivers/blast.py", line 134, in ununique_search_results
    utils.ununique_BLAST_tabular_output(self.search_output_path, self.names_dict)
  File "/usr/local/lib/python2.7/dist-packages/anvio-2.0.3-py2.7-linux-x86_64.egg/anvio/utils.py", line 971, in ununique_BLAST_tabular_output
    for subject_id in names_dict[fields[1]]:
KeyError: '8622f82b_991'

ok, so I run this command again without --use-ncbi-blast but still changing --min-occurrence 2

$ anvi-pan-genome -g woese-GENOMES.h5 -o woese/ -J woese2 --num-threads 8 --min-occurrence 2

here is the output.

WARNING
===============================================
If you publish results from this workflow, please do not forget to cite DIAMOND
(doi:10.1038/nmeth.3176), unless you use it with --use-ncbi-blast flag, and MCL
(http://micans.org/mcl/ and doi:10.1007/978-1-61779-361-5_15)

Genomes storage ..............................: Initialized (storage hash: 9d3210ab)
Num genomes in storage .......................: 33
Num genomes will be used .....................: 33
Pan database .................................: A new database, /home/j/Documents/Analytics/anvio_processing/Eric_becraft/woese/woese2-PAN.db, has been created.
Exclude partial gene calls ...................: False

Unique protein sequences FASTA ...............: /home/j/Documents/Analytics/anvio_processing/Eric_becraft/woese/combined-proteins.fa                                            

Num protein sequences reported ...............: 23,953
Num excluded gene calls ......................: 0
Diamond search db ............................: /home/j/Documents/Analytics/anvio_processing/Eric_becraft/woese/combined-proteins.fa.dmnd                                       
DIAMOND is set to be .........................: Fast
Diamond blastp results .......................: /home/j/Documents/Analytics/anvio_processing/Eric_becraft/woese/diamond-search-results.daa                                      
Diamond un-uniquedtabular output file ........: /home/j/Documents/Analytics/anvio_processing/Eric_becraft/woese/diamond-search-results.txt                                      
Min percent identity .........................: 0.0                                                                                                                             
Maxbit .......................................: 0.5
Filtered search results ......................: 72,247 edges stored                                                                                                             
MCL input ....................................: /home/j/Documents/Analytics/anvio_processing/Eric_becraft/woese/mcl-input.txt
MCL inflation ................................: 2.0
MCL output ...................................: /home/j/Documents/Analytics/anvio_processing/Eric_becraft/woese/mcl-clusters.txt                                                
Number of protein clusters ...................: 16,781
protein clusters info ........................: 16781 PCs stored in the database                                                                                                
PCs min occurrence ...........................: 2 (the filter removed 14390 PCs)                                                                                                
New hierarchical clusetring ..................: "frequency:euclidean:ward" has been added to the database...                                                                    
[20 Dec 16 22:01:50 Vectors from 2 matrices] Recovering the tree from the clustering result                                                                                     
WARNING
===============================================
Clustering has failed for "presence-absence": "Linkage 'Z' uses the same cluster
more than once."

Anvi'o samples information ...................: /home/j/Documents/Analytics/anvio_processing/Eric_becraft/woese/woese2-samples-information.txt                                  
Anvi'o samples order .........................: /home/j/Documents/Analytics/anvio_processing/Eric_becraft/woese/woese2-samples-order.txt                                        
log file .....................................: /home/j/Documents/Analytics/anvio_processing/Eric_becraft/woese/log.txt

which looks like it ran the whole process again just using DIAMOND instead?

I will update the code and see about downgrading scipy :)

meren commented 7 years ago

This is very helpful! Thank you very much. I will look into this soon.

meren commented 7 years ago

If you update your master, this should work now, Jarrod. It took hours to find what was wrong, and ironically, this little change was enough to solve it: 3d198ba934608c42878b6b0f887c7cd6bb117f60.

Thank you very much for your help with this one.

jarrodscott commented 7 years ago

No problem Meren. updated and re-running now. if you don't hear from me it means all went well.

And thank you for solving this problem!

merenlab / anvio

Re-running Pan-genome with different parameters #430