merenlab / anvio

An analysis and visualization platform for 'omics data
http://merenlab.org/software/anvio
GNU General Public License v3.0
439 stars 145 forks source link

Anvi'o is not compatible with SciPy 0.18.0 (the previous title was: Strange behavior in --min-occurrence in anvi-pan-genome) #395

Closed jbird9 closed 7 years ago

jbird9 commented 8 years ago

Hi Meren,

I was attempting to use anvi-pan-genome for a set of external genomes. With the default --min-occurrence 1 and using --min-occurrence 3 the program seems to run perfectly. However, when I tried to remove the just the singletons using --min-occurrence 2 I got the following error:

[30 Aug 16 14:35:23 Hierarchical clustering] ..                                                                                                                                                            Traceback (most recent call last):
  File "/usr/local/bin/anvi-pan-genome", line 99, in <module>
    pan.process()
  File "/usr/local/lib/python2.7/dist-packages/anvio/panops.py", line 680, in process
    self.gen_ad_hoc_anvio_run(view_data_presence_absence_file_path, experimental_data_file_path, additional_view_data_file_path, samples_info_file_path)
  File "/usr/local/lib/python2.7/dist-packages/anvio/panops.py", line 592, in gen_ad_hoc_anvio_run
    ad_hoc_run.generate()
  File "/usr/local/lib/python2.7/dist-packages/anvio/summarizer.py", line 853, in generate
    self.gen_clustering_of_view_data()
  File "/usr/local/lib/python2.7/dist-packages/anvio/summarizer.py", line 869, in gen_clustering_of_view_data
    clustering.get_newick_tree_data(self.matrix_data_for_clustering, self.tree_file_path)
  File "/usr/local/lib/python2.7/dist-packages/anvio/clustering.py", line 118, in get_newick_tree_data
    tree = get_clustering_as_tree(vectors, linkage, distance, progress)
  File "/usr/local/lib/python2.7/dist-packages/anvio/clustering.py", line 168, in get_clustering_as_tree
    tree = hierarchy.to_tree(linkage, rd=False)
  File "/usr/local/lib/python2.7/dist-packages/scipy/cluster/hierarchy.py", line 1007, in to_tree
    is_valid_linkage(Z, throw=True, name='Z')
  File "/usr/local/lib/python2.7/dist-packages/scipy/cluster/hierarchy.py", line 1421, in is_valid_linkage
    % name_str)
ValueError: Linkage 'Z' uses the same cluster more than once.

The dataset includes a number of fragmented SAGs and anvi-pan-genome -v yields:

Anvi'o version ...............................: 2.0.2
Profile DB version ...........................: 16
Contigs DB version ...........................: 6
Samples information DB version ...............: 2
Auxiliary HDF5 DB version ....................: 1
Users DB version (for anvi-server) ...........: 1

Thanks,

Jordan

meren commented 8 years ago

Hi Jordan,

This is very interesting. I have been using that flag quite often recently and I didn't really run into any issues with it. I am curious whether this could be something specific to your data. I would be happy to take a look if you were to share it with me, or find a way to help me replicate the error :)

Best,

meren commented 8 years ago

Although it will take some time for me to get back to it as I am traveling quite extensively :(

jbird9 commented 8 years ago

Hi Meren,

I hope your travels are going well. I have attached a link to the SAG fastas I have been working with. I had two external genome files one of them has just the bacteria while the other includes two tiny archaeal genomes. I understand traveling can suck the time away so I am not expecting a quick solution.

Thanks for your help,

Jordan Bird ​

(Meren's note: I edited the content to remove the attachement)

meren commented 8 years ago

Hi Jordan,

I downloaded them to have them with me to investigate this whenever I can find some time.

(Meanwhile I removed the link from your message so the data link is not archived).

Thank you!

meren commented 8 years ago

Hi Jordan,

I run the pangenomic analysis using the files you sent with the parameter --min-occurrence 2, and it seems everything worked nicely and produced the following output:

meren ~/Downloads/SAG_FASTAs $ anvi-pan-genome -e external_genomes.csv -o pan --num-threads 6 --min-occurrence 2

WARNING
===============================================
If you publish results from this workflow, please do not forget to cite DIAMOND
(doi:10.1038/nmeth.3176), unless you use it with --use-ncbi-blast flag, and MCL
(http://micans.org/mcl/ and doi:10.1007/978-1-61779-361-5_15)

External genomes .............................: 45 have been initialized.
Internal genomes .............................: 0 have been initialized.
Exclude partial gene calls ...................: False

* JS1_60B_A09 is initialized with 1,499 genes (0 were excluded)
* JS1_59E_13H_C14 is initialized with 542 genes (0 were excluded)
* NT-B2-AD-617-P19 is initialized with 1,168 genes (0 were excluded)
* JS1_60B_N06 is initialized with 2,875 genes (0 were excluded)
* OPB41_60B_13H_C09 is initialized with 1,067 genes (0 were excluded)
* JS1_60B_M10 is initialized with 1,880 genes (0 were excluded)
* OPB41_60B_13H_B07 is initialized with 963 genes (0 were excluded)
* JS1_59E_13H_F07 is initialized with 640 genes (0 were excluded)
* MG2 is initialized with 383 genes (0 were excluded)
* OPB41_59E_21H_M23 is initialized with 555 genes (0 were excluded)
* JS1_59E_13H_O21 is initialized with 922 genes (0 were excluded)
* JS1_60B_I07 is initialized with 862 genes (0 were excluded)
* OP8_59E_13H_E21 is initialized with 524 genes (0 were excluded)
* Chl_60B_28H_A21 is initialized with 910 genes (0 were excluded)
* OP8_59E_13H_M21 is initialized with 599 genes (0 were excluded)
* D-anilini-AD-619-D02 is initialized with 696 genes (0 were excluded)
* Chl_60B_28H_C14 is initialized with 1,067 genes (0 were excluded)
* OPB41_59E_21H_O21 is initialized with 934 genes (0 were excluded)
* JS1_60B_E13 is initialized with 1,398 genes (0 were excluded)
* JS1_59E_13H_L23 is initialized with 588 genes (0 were excluded)
* OPB41-AD-617-I09 is initialized with 672 genes (0 were excluded)
* OP8-AD-619-P22 is initialized with 488 genes (0 were excluded)
* OP8-AD-617-C16 is initialized with 1,045 genes (0 were excluded)
* JS1_59E_13H_K04 is initialized with 1,520 genes (0 were excluded)
* Unk_60B_28H_C08 is initialized with 563 genes (0 were excluded)
* OP8_59E_13H_F13 is initialized with 1,005 genes (0 were excluded)
* Chl_60B_13H_A19 is initialized with 636 genes (0 were excluded)
* Chloroflexi-AD-619-B06 is initialized with 37 genes (0 were excluded)
* JS1_59E_13H_E20 is initialized with 861 genes (0 were excluded)
* OPB41-AD-617-M19 is initialized with 1,015 genes (0 were excluded)
* Chloroflexi-AD-619-N02 is initialized with 429 genes (0 were excluded)
* NT-B2-AD-619-E05 is initialized with 1,517 genes (0 were excluded)
* JS1_60B_M21 is initialized with 1,479 genes (0 were excluded)
* Chloroflexi-AD-619-G11 is initialized with 109 genes (0 were excluded)
* OPB41_59E_21H_M06 is initialized with 132 genes (0 were excluded)
* OPB41_60B_13H_O22 is initialized with 665 genes (0 were excluded)
* NT-B2-AD-619-P03 is initialized with 522 genes (0 were excluded)
* first_spades_MCG is initialized with 676 genes (0 were excluded)
* JS1_59E_13H_E15 is initialized with 309 genes (0 were excluded)
* OPB41_60B_13H_A10 is initialized with 1,054 genes (0 were excluded)
* JS1_59E_13H_L14 is initialized with 238 genes (0 were excluded)
* OPB41_59E_21H_B05 is initialized with 611 genes (0 were excluded)
* JS1_60B_D03 is initialized with 1,785 genes (0 were excluded)
* D-anilini-AD-619-E09 is initialized with 1,040 genes (0 were excluded)
* OP8_59E_13H_M19 is initialized with 676 genes (0 were excluded)

Num protein sequences ........................: 39,156
Num excluded gene calls ......................: 0
Num unique protein sequences .................: 32,816
Combined protein sequences FASTA .............: /Users/meren/Downloads/SAG_FASTAs/pan/combined-proteins.fa
Unique protein sequences FASTA ...............: /Users/meren/Downloads/SAG_FASTAs/pan/combined-proteins.fa.unique

WARNING
===============================================
Notice: A diamond database is found in the output directory, and will be used!

WARNING
===============================================
Notice: A DIAMOND search result is found in the output directory: skipping
BLASTP!

WARNING
===============================================
Notice: A DIAMOND tabular output is found in the output directory. Anvi'o will
not generate another one!

Min percent identity .........................: 0.0
Maxbit .......................................: 0.5
Filtered search results ......................: 308,113 edges stored
MCL input ....................................: /Users/meren/Downloads/SAG_FASTAs/pan/mcl-input.txt
MCL inflation ................................: 2.0
MCL output ...................................: /Users/meren/Downloads/SAG_FASTAs/pan/mcl-clusters.txt
Number of protein clusters ...................: 14,941
protein clusters info ........................: /Users/meren/Downloads/SAG_FASTAs/pan/protein-clusters.txt
PCs min occurrence ...........................: 2 (the filter removed 9032 PCs)
Anvi'o view data for protein clusters ........: /Users/meren/Downloads/SAG_FASTAs/pan/anvio-view-data.txt
Anvi'o additional view data ..................: /Users/meren/Downloads/SAG_FASTAs/pan/anvio-additional-view-data.txt
Anvi'o samples information ...................: /Users/meren/Downloads/SAG_FASTAs/pan/anvio-samples-information.txt

WARNING
===============================================
filesnpaths::gen_output_directory: the client asked the existing directory
"/Users/meren/Downloads/SAG_FASTAs/pan/pan" to be removed.. Just so you know :/
(You have 5 seconds to press CTRL + C).

Tree .........................................: /Users/meren/Downloads/SAG_FASTAs/pan/pan/tree.txt
Anvi'o samples order .........................: /Users/meren/Downloads/SAG_FASTAs/pan/pan/anvio-samples-order.txt
Ad hoc anvi'o run files ......................: /Users/meren/Downloads/SAG_FASTAs/pan/pan
log file .....................................: /Users/meren/Downloads/SAG_FASTAs/pan/log.txt
meren ~/Downloads/SAG_FASTAs $ cd pan/pan
meren ~/Downloads/SAG_FASTAs/pan/pan $ anvi-interactive -p profile.db -s samples.db -t tree.txt -d view_data.txt -A additional_view_data.txt --manual

image

I have a feeling that if you remove the previous output directory completely, and re-run the same command everything will run this time. Can you please try?

Thank you,

jbird9 commented 8 years ago

I have now attempted this on two separate computers and I am getting the same error with linkage Z. Perhaps, this is distribution specific bug.

On Fri, Sep 2, 2016 at 7:31 PM, A. Murat Eren notifications@github.com wrote:

Closed #395 https://github.com/meren/anvio/issues/395.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/meren/anvio/issues/395#event-777201564, or mute the thread https://github.com/notifications/unsubscribe-auth/ADa5FM97P0uvQZcichIe05H7UsXPOUc0ks5qmLHbgaJpZM4JxBcs .

Jordan Bird Ph D Student at The University of Tennessee jbird9@utk.edu or jordantobybird@gmail.com 870-718-9053

meren commented 8 years ago

Hi Joradan,

What is your version of scipy? Here is mine:

meren ~ $ python -c 'import scipy; print scipy.__version__'
0.17.1

Thanks,

jbird9 commented 8 years ago

Mine is 0.18.0

On Sep 4, 2016 2:24 PM, "A. Murat Eren" notifications@github.com wrote:

Hi Joradan,

What is your version of scipy? Here is mine:

meren ~ $ python -c 'import scipy; print scipy.version' 0.17.1

Thanks,

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/meren/anvio/issues/395#issuecomment-244620388, or mute the thread https://github.com/notifications/unsubscribe-auth/ADa5FGQkqwv6TurjY7Yb7mHIrhcRtAMfks5qmwzqgaJpZM4JxBcs .

meren commented 8 years ago

Crap. I see 0.18.0 is only 2 weeks old. I wonder if this is something about their new release.

If you can downgrade your version to 0.17.1 it probably will fix the problem. On the other hand I will look into this as soon as possible.

Sorry about this.

jbird9 commented 8 years ago

Hi Meren,

Just to let you know, I installed scipy 0.17.1 and reran the analysis and it ran through without a hitch. Clearly, there was a change in scipy 0.18.0 that is causing the bug.

Thanks,

Jordan

On Sun, Sep 4, 2016 at 3:00 PM, A. Murat Eren notifications@github.com wrote:

Reopened #395 https://github.com/meren/anvio/issues/395.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/meren/anvio/issues/395#event-777718052, or mute the thread https://github.com/notifications/unsubscribe-auth/ADa5FMak2OStPR3xlCVMZWJfruA3H80-ks5qmxU9gaJpZM4JxBcs .

Jordan Bird Ph D Student at The University of Tennessee jbird9@utk.edu or jordantobybird@gmail.com 870-718-9053

meren commented 8 years ago

Thank you for looking into this. Upgrades that brake things that worked perfectly indeed have a special place in hell.