lskatz / mashtree

:deciduous_tree: Create a tree using Mash distances
GNU General Public License v3.0
156 stars 24 forks source link

Mashtree::createTreeFromPhylip: Can't call method "as_text" on an undefined value - using .msh as input #86

Open Lumimar opened 8 months ago

Lumimar commented 8 months ago

Hello there, I don't seem to be able to get a .dnd output when running mashtree from *msh files, whereas I can get a .dnd output if I run mashtree using the fastq files as input... see below

I ran mashtree --mindepth 0 --numcpus 4 --outmatrix mashmatrix.txt ERR*/*msh > mashtree.dnd

on the output generated by mash sketch -k 21 -s 10000 - -o ${out_dir}"/"${id}

with the following log

mashtree: main: Found mash version 2 - /home/xxx/.local/bin/mash
mashtree: main: Temporary directory will be /tmp/xxx_16154334/MASHTREE.EUJ6ti
mashtree: main: mashtree on 448 files

mashtree: mashSketch(TID2): This thread will work on 112 sketches
mashtree: mashSketch(TID2): Working on file 1 out of 112
mashtree: mashSketch(TID2): Input file is a sketch file itself and will be used as such: ERRxxx/ERRxxx.msh
mashtree: mashSketch(TID2): WARNING: ERRxxx.msh was already mashed.
.....
mashtree: mashDist(TID6): Distances for /tmp/xxx_16154334/MASHTREE.EUJ6ti/ERRxxx1.msh
mashtree: mashDist(TID7): Distances for /tmp/xxx_16154334/MASHTREE.EUJ6ti/ERRxxx2.msh
mashtree: mashDist(TID5): Distances for /tmp/xxx_16154334/MASHTREE.EUJ6ti/ERRxxx3.msh
mashtree: mashDist(TID6): Distances for /tmp/xxx_16154334/MASHTREE.EUJ6ti/ERRxxx4.msh
mashtree: mashDist(TID5): Distances for /tmp/xxx_16154334/MASHTREE.EUJ6ti/ERRxxx5.msh
mashtree: mashDistance: Databasing distances (1/4, TID5)
mashtree: mashDistance: Waiting to join thread (2/4, TID6)
mashtree: mashDistance: Databasing distances (2/4, TID6)
mashtree: mashDistance: Waiting to join thread (3/4, TID7)
mashtree: mashDistance: Databasing distances (3/4, TID7)
mashtree: mashDistance: Waiting to join thread (4/4, TID8)
mashtree: mashDistance: Databasing distances (4/4, TID8)
mashtree: mashDistance: Converting to phylip format into /tmp/xxxx_16154334/MASHTREE.EUJ6ti/distances.phylip
mashtree: mashDistance: Writing a distance matrix to mashmatrix.txt
mashtree: Mashtree::createTreeFromPhylip: Can't call method "as_text" on an undefined value 
Stopped at ...Mashtree.pm line 339.

the outmatrix was generated, but not the .dnd output... looking at Mashtree.pm it seems that $outdir/tree.dnd.tmp was not created ( I removed unlink() on line 343 but no .tmp file appeared).

Mashtree version 1.4.6, installed with conda on a Linux cluster with the following configuration 4.18.0-513.9.1.el8_9.x86_64 could not install via cpanm because of some missing dependencies that I could not install without sudo. now if I run cpanm -l ~ Mashtree I get Mashtree is up to date. (1.4.6)

Is it advisable to use fastq rather than msh as input? Many thanks!

ohdongha commented 6 months ago

Sorry for the hitchhiking... I have the exact same error message when trying to run mashtree (1.4.6) with *.msh files as input.

looking at Mashtree.pm it seems that $outdir/tree.dnd.tmp was not created

It seems like the issue is with distancesToPhylip because the distances.phylip file is almost empty (in my case) - I guess it should have the matrix of distances in PHYLIP format so that quicktree can draw trees in the next step:

$ cat temp_mashtree/distances.phylip 
    0

I wonder if the issue is when trying to parse the genome names to create the phylip distance matrix. In my case, the first several lines of the distances.db.tsv file looks like this:

$ head temp_mashtree/distances.db.tsv | column -t -s$'\t'
genome1                                      genome2                                      distance
GCA_022405125.1_ASM2240512v1_genomic.fna.gz  GCA_022405125.1_ASM2240512v1_genomic.fna.gz  0
GCA_022405125.1_ASM2240512v1_genomic.fna.gz  GCA_028453695.1_APUR_v2.2.0_genomic.fna.gz   0.203582
GCA_022405125.1_ASM2240512v1_genomic.fna.gz  GCA_028454255.1_HLIG_v2.2.0_genomic.fna.gz   0.20308
GCA_022405125.1_ASM2240512v1_genomic.fna.gz  GCA_029448645.1_ASM2944864v1_genomic.fna.gz  0.19694
GCA_022405125.1_ASM2240512v1_genomic.fna.gz  GCA_913789895.3_iySelTumu1.3_genomic.fna.gz  0.202798
GCA_022405125.1_ASM2240512v1_genomic.fna.gz  GCA_913789915.3_iySphMoni1.3_genomic.fna.gz  0.209807
GCA_022405125.1_ASM2240512v1_genomic.fna.gz  GCA_916610135.2_iyMacEuro1.2_genomic.fna.gz  0.197773
GCA_022405125.1_ASM2240512v1_genomic.fna.gz  GCA_916610235.2_iyLasMori1.2_genomic.fna.gz  0.203062
GCA_022405125.1_ASM2240512v1_genomic.fna.gz  GCA_916610255.1_iyLasLatv2.1_genomic.fna.gz  0.202186

Is there a rule the genome file names need to follow (limit in length, etc.)?

...

[Edit:] When I tried the same set with the fasta sequence files as input, mashtree ran successfully. It would be good to be able to work with msh files as well, though.

PCas95 commented 1 month ago

Same issue here. I thought another behaviour I observed was related, but it doesn't seem the case.

Is the issue being worked on? It would be nice to run mashtree on pre-calculated mash sketches