gjospin / PhyloSift

Phylogenetic and taxonomic analysis for genomes and metagenomes
82 stars 18 forks source link

build_marker taxids invalid #468

Open KevinAMeyer opened 9 years ago

KevinAMeyer commented 9 years ago

Hello,

I'm looking to create a custom database to run in Phylosift and keep getting error messages when trying to build the database using The Monkey. I can index the created database, but the taxa file (taxa.csv) is empty, almost all of my tax_ids have been removed, and when I run phylosift all --custom using the database created by The Monkey it returns no hits whatsoever. Including hits to reference genomes that I used to build the database.

Is this a file formatting issue? I'm not quite sure why I'm getting these error messages as all of my TaxonIDs are in the ncbi database files with the correct taxonomy.

Any assistance would be appreciated.

Cheers, Kevin A. Meyer

Here is the command and output from one such run in the terminal.

phylosift build_marker --force --alignment gene_alignment.fasta --taxonmap gi_mapping_file.txt --tree_pd 1

Checking integrity on File gene_alignment.fasta Using /geomicro/data21/kevmey/share/phylosift/markers/gene_alignment/gene_alignment.aln ID_map is 28 long FastTree Version 2.1.3 SSE3 Alignment: /geomicro/data21/kevmey/share/phylosift/markers/gene_alignment/gene_alignment.clean Amino acid distances: BLOSUM45 Joins: balanced Support: SH-like 1000 Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1 TopHits: 1.00*sqrtN close=default refresh=0.80 ML Model: Jones-Taylor-Thorton, CAT approximation with 20 rate categories Initial topology in 0.01 seconds Refining topology: 18 rounds ME-NNIs, 2 rounds ME-SPRs, 9 rounds ML-NNIs Total branch-length 1.798 after 0.07 sec ML-NNI round 1: LogLk = -3459.471 NNIs 4 max delta 3.18 Time 0.27 Switched to using 20 rate categories (CAT approximation)1 of 20
Rate categories were divided by 0.839 so that average rate = 1.0 CAT-based log-likelihoods may not be comparable across runs Use -gamma for approximate but comparable Gamma(20) log-likelihoods ML-NNI round 2: LogLk = -3079.023 NNIs 1 max delta 0.03 Time 0.55 Turning off heuristics for final round of ML NNIs (converged) ML-NNI round 3: LogLk = -3078.406 NNIs 1 max delta 0.02 Time 0.80 (final) Optimize all lengths: LogLk = -3078.386 Time 0.86 Total time: 1.08 seconds Unique: 24/28 Bad splits: 0/21 PDA - Phylogenetic Diversity Analyzer version 0.5.2 Copyright (c) 2006-2007 Bui Quang Minh, Steffen Klaere and Arndt von Haeseler.

Reading tree file /geomicro/data21/kevmey/share/phylosift/markers/gene_alignment/gene_alignment.tree ... Tree contains 28 taxa and 53 branches

Running PD algorithm on UNROOTED tree...

Greedy Algorithm... Time used: 0.000000 seconds.

Results are summarized in /geomicro/data21/kevmey/share/phylosift/markers/gene_alignment/gene_alignment.tree.pruning.log

FastTree Version 2.1.3 SSE3 Alignment: /geomicro/data21/kevmey/share/phylosift/markers/gene_alignment/gene_alignment.pruned.fasta Amino acid distances: BLOSUM45 Joins: balanced Support: SH-like 1000 Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1 TopHits: 1.00*sqrtN close=default refresh=0.80 ML Model: Jones-Taylor-Thorton, CAT approximation with 20 rate categories Initial topology in 0.00 seconds Refining topology: 6 rounds ME-NNIs, 2 rounds ME-SPRs, 3 rounds ML-NNIs Total branch-length 0.598 after 0.00 sec ML-NNI round 1: LogLk = -1231.497 NNIs 0 max delta 0.00 Time 0.00 Switched to using 20 rate categories (CAT approximation) Rate categories were divided by 0.735 so that average rate = 1.0 CAT-based log-likelihoods may not be comparable across runs Use -gamma for approximate but comparable Gamma(20) log-likelihoods ML-NNI round 2: LogLk = -1180.745 NNIs 0 max delta 0.00 Time 0.01 Turning off heuristics for final round of ML NNIs (converged) ML-NNI round 3: LogLk = -1180.745 NNIs 0 max delta 0.00 Time 0.01 (final) Optimize all lengths: LogLk = -1180.368 Time 0.02 Total time: 0.02 seconds Unique: 3/3 Bad splits: 0/0 CLEAN_ALN : /geomicro/data21/kevmey/share/phylosift/markers/gene_alignment/gene_alignment.cleanLooking for representatives Reading NCBI taxonomy at /geomicro/data21/kevmey/share/phylosift/ncbi Reading merged ncbi nodes Done reading merged Reading deleted ncbi nodes Done reading deleted Running taxit taxtable -d /geomicro/data21/kevmey/share/phylosift/markers/gene_alignment/../ncbi_taxonomy.db -t /geomicro/data21/kevmey/share/phylosift/markers/gene_alignment/tax_ids.txt -o /geomicro/data21/kevmey/share/phylosift/markers/gene_alignment/taxa.csv Taxid 74547 not found in taxonomy. Taxid 449447 not found in taxonomy. Taxid 316278 not found in taxonomy. Taxid 103690 not found in taxonomy. Taxid 32051 not found in taxonomy. Taxid 84588 not found in taxonomy. Taxid 1148 not found in taxonomy. Taxid 64471 not found in taxonomy. Taxid 59919 not found in taxonomy. Taxid 59922 not found in taxonomy. Taxid 43989 not found in taxonomy. Taxid 329726 not found in taxonomy. Taxid 321327 not found in taxonomy. Taxid 251221 not found in taxonomy. Taxid 180281 not found in taxonomy. Taxid 696747 not found in taxonomy. Taxid 93059 not found in taxonomy. Taxid 197221 not found in taxonomy. Taxid 93060 not found in taxonomy. Taxid 167539 not found in taxonomy. Taxid 167555 not found in taxonomy. Taxid 146891 not found in taxonomy. Taxid 167546 not found in taxonomy. Taxid 321332 not found in taxonomy. Taxid 167542 not found in taxonomy. Some taxids were invalid. Exiting. Using the mapping stuff taxit create -a "Guillaume Jospin" -d "simple package for reconciliation only" -l temp -f /geomicro/data21/kevmey/share/phylosift/markers/gene_alignment/gene_alignment.clean -t /geomicro/data21/kevmey/share/phylosift/markers/gene_alignment/gene_alignment.pruned.tree -s /geomicro/data21/kevmey/share/phylosift/markers/gene_alignment/gene_alignment.pruned.log -P /geomicro/data21/kevmey/share/phylosift/markers/gene_alignment/temp_refcreating tmpread /geomicro/data21/kevmey/share/phylosift/markers/gene_alignment/gene_alignment.tmpread.fasta MAPFILE : gi_mapping_file.txt Return_hash has : 24 taxons TAXON_ARRAY 24 ncbi tree has lots of nodes Read a bunch of merged nodes NCBI COUNT : 24 Making 53 Nodes AFTER ncbi_subtree /opt/packages/PhyloSift/1.0.1/bin/readconciler /geomicro/data21/kevmey/share/phylosift/markers/gene_alignment/gene_alignment.subtree /geomicro/data21/kevmey/share/phylosift/markers/gene_alignment/gene_alignment.tmpread.mangled /geomicro/data21/kevmey/share/phylosift/markers/gene_alignment/gene_alignment.gene_map /geomicro/data21/kevmey/share/phylosift/markers/gene_alignment/gene_alignment.taxonmap The reference tree has 53 nodes The read tree has 4 nodes Done with gene tree splits Making gene tree map 3 genes mapped Making species to gene tree map Error no mapping found for 25 species mapped rs.size() 25 Finding best edges AFTER readconciler Running cd "/geomicro/data21/kevmey/share/phylosift/markers/gene_alignment";taxit create -c -d "Creating a reference package for PhyloSift for the gene_alignment marker" -l "gene_alignment" -f "/geomicro/data21/kevmey/share/phylosift/markers/gene_alignment/gene_alignment.clean" -t "/geomicro/data21/kevmey/share/phylosift/markers/gene_alignment/gene_alignment.pruned.tree" -s "/geomicro/data21/kevmey/share/phylosift/markers/gene_alignment/gene_alignment.pruned.log" -P "gene_alignment"

gjospin commented 9 years ago

Hello, First off, the taxa.csv ffle was part of a feature that did not work properly and that we ended up not fully developing. So you can ignore that part. I should spend some time to either remove or skip that part of the code so people don't get confused.

The PD number seems rather high but I am not a tree expert so maybe it will work for your purpose. It is intended to trim the resulting tree so that there are no branches smaller than the number. By default we use 0.01 I think. This ends up removing leaves that are really similar to reduce the ambiguity of the placements.

You will want to look at the reps file after the marker build which are picked to maximize the PD of the representatives. Those will be the sequences used in the Lastal matching before aligning to the HMM.

I would suggest to dial down the tree_pd value and see how it goes. Your output seems fairly normal otherwise.

Let us know if you have more questions.

On Tue, Mar 31, 2015 at 12:52 PM, KevinAMeyer notifications@github.com wrote:

Hello,

I'm looking to create a custom database to run in Phylosift and keep getting error messages when trying to build the database using The Monkey. I can index the created database, but the taxa file (taxa.csv) is empty, almost all of my tax_ids have been removed, and when I run phylosift all --custom using the database created by The Monkey it returns no hits whatsoever. Including hits to reference genomes that I used to build the database.

Is this a file formatting issue? I'm not quite sure why I'm getting these error messages as all of my TaxonIDs are in the ncbi database files with the correct taxonomy.

Any assistance would be appreciated.

Cheers, Kevin A. Meyer

Here is the command and output from one such run in the terminal.

phylosift build_marker --force --alignment gene_alignment.fasta --taxonmap gi_mapping_file.txt --tree_pd 1

Checking integrity on File gene_alignment.fasta Using /geomicro/data21/kevmey/share/phylosift/markers/gene_alignment/gene_alignment.aln ID_map is 28 long FastTree Version 2.1.3 SSE3 Alignment: /geomicro/data21/kevmey/share/phylosift/markers/gene_alignment/gene_alignment.clean Amino acid distances: BLOSUM45 Joins: balanced Support: SH-like 1000 Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1 TopHits: 1.00*sqrtN close=default refresh=0.80 ML Model: Jones-Taylor-Thorton, CAT approximation with 20 rate categories Initial topology in 0.01 seconds Refining topology: 18 rounds ME-NNIs, 2 rounds ME-SPRs, 9 rounds ML-NNIs Total branch-length 1.798 after 0.07 sec ML-NNI round 1: LogLk = -3459.471 NNIs 4 max delta 3.18 Time 0.27 Switched to using 20 rate categories (CAT approximation)1 of 20

Rate categories were divided by 0.839 so that average rate = 1.0 CAT-based log-likelihoods may not be comparable across runs Use -gamma for approximate but comparable Gamma(20) log-likelihoods ML-NNI round 2: LogLk = -3079.023 NNIs 1 max delta 0.03 Time 0.55 Turning off heuristics for final round of ML NNIs (converged) ML-NNI round 3: LogLk = -3078.406 NNIs 1 max delta 0.02 Time 0.80 (final) Optimize all lengths: LogLk = -3078.386 Time 0.86 Total time: 1.08 seconds Unique: 24/28 Bad splits: 0/21 PDA - Phylogenetic Diversity Analyzer version 0.5.2 Copyright (c) 2006-2007 Bui Quang Minh, Steffen Klaere and Arndt von Haeseler.

Reading tree file /geomicro/data21/kevmey/share/phylosift/markers/gene_alignment/gene_alignment.tree ... Tree contains 28 taxa and 53 branches

Running PD algorithm on UNROOTED tree...

Greedy Algorithm... Time used: 0.000000 seconds.

Results are summarized in /geomicro/data21/kevmey/share/phylosift/markers/gene_alignment/gene_alignment.tree.pruning.log

FastTree Version 2.1.3 SSE3 Alignment: /geomicro/data21/kevmey/share/phylosift/markers/gene_alignment/gene_alignment.pruned.fasta Amino acid distances: BLOSUM45 Joins: balanced Support: SH-like 1000 Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1 TopHits: 1.00*sqrtN close=default refresh=0.80 ML Model: Jones-Taylor-Thorton, CAT approximation with 20 rate categories Initial topology in 0.00 seconds Refining topology: 6 rounds ME-NNIs, 2 rounds ME-SPRs, 3 rounds ML-NNIs Total branch-length 0.598 after 0.00 sec ML-NNI round 1: LogLk = -1231.497 NNIs 0 max delta 0.00 Time 0.00 Switched to using 20 rate categories (CAT approximation) Rate categories were divided by 0.735 so that average rate = 1.0 CAT-based log-likelihoods may not be comparable across runs Use -gamma for approximate but comparable Gamma(20) log-likelihoods ML-NNI round 2: LogLk = -1180.745 NNIs 0 max delta 0.00 Time 0.01 Turning off heuristics for final round of ML NNIs (converged) ML-NNI round 3: LogLk = -1180.745 NNIs 0 max delta 0.00 Time 0.01 (final) Optimize all lengths: LogLk = -1180.368 Time 0.02 Total time: 0.02 seconds Unique: 3/3 Bad splits: 0/0 CLEAN_ALN : /geomicro/data21/kevmey/share/phylosift/markers/gene_alignment/gene_alignment.cleanLooking for representatives Reading NCBI taxonomy at /geomicro/data21/kevmey/share/phylosift/ncbi Reading merged ncbi nodes Done reading merged Reading deleted ncbi nodes Done reading deleted Running taxit taxtable -d /geomicro/data21/kevmey/share/phylosift/markers/gene_alignment/../ncbi_taxonomy.db -t /geomicro/data21/kevmey/share/phylosift/markers/gene_alignment/tax_ids.txt -o /geomicro/data21/kevmey/share/phylosift/markers/gene_alignment/taxa.csv Taxid 74547 not found in taxonomy. Taxid 449447 not found in taxonomy. Taxid 316278 not found in taxonomy. Taxid 103690 not found in taxonomy. Taxid 32051 not found in taxonomy. Taxid 84588 not found in taxonomy. Taxid 1148 not found in taxonomy. Taxid 64471 not found in taxonomy. Taxid 59919 not found in taxonomy. Taxid 59922 not found in taxonomy. Taxid 43989 not found in taxonomy. Taxid 329726 not found in taxonomy. Taxid 321327 not found in taxonomy. Taxid 251221 not found in taxonomy. Taxid 180281 not found in taxonomy. Taxid 696747 not found in taxonomy. Taxid 93059 not found in taxonomy. Taxid 197221 not found in taxonomy. Taxid 93060 not found in taxonomy. Taxid 167539 not found in taxonomy. Taxid 167555 not found in taxonomy. Taxid 146891 not found in taxonomy. Taxid 167546 not found in taxonomy. Taxid 321332 not found in taxonomy. Taxid 167542 not found in taxonomy. Some taxids were invalid. Exiting. Using the mapping stuff taxit create -a "Guillaume Jospin" -d "simple package for reconciliation only" -l temp -f /geomicro/data21/kevmey/share/phylosift/markers/gene_alignment/gene_alignment.clean -t /geomicro/data21/kevmey/share/phylosift/markers/gene_alignment/gene_alignment.pruned.tree -s /geomicro/data21/kevmey/share/phylosift/markers/gene_alignment/gene_alignment.pruned.log -P /geomicro/data21/kevmey/share/phylosift/markers/gene_alignment/temp_refcreating tmpread /geomicro/data21/kevmey/share/phylosift/markers/gene_alignment/gene_alignment.tmpread.fasta MAPFILE : gi_mapping_file.txt Return_hash has : 24 taxons TAXON_ARRAY 24 ncbi tree has lots of nodes Read a bunch of merged nodes NCBI COUNT : 24 Making 53 Nodes AFTER ncbi_subtree /opt/packages/PhyloSift/1.0.1/bin/readconciler /geomicro/data21/kevmey/share/phylosift/markers/gene_alignment/gene_alignment.subtree /geomicro/data21/kevmey/share/phylosift/markers/gene_alignment/gene_alignment.tmpread.mangled /geomicro/data21/kevmey/share/phylosift/markers/gene_alignment/gene_alignment.gene_map /geomicro/data21/kevmey/share/phylosift/markers/gene_alignment/gene_alignment.taxonmap The reference tree has 53 nodes The read tree has 4 nodes Done with gene tree splits Making gene tree map 3 genes mapped Making species to gene tree map Error no mapping found for 25 species mapped rs.size() 25 Finding best edges AFTER readconciler Running cd "/geomicro/data21/kevmey/share/phylosift/markers/gene_alignment";taxit create -c -d "Creating a reference package for PhyloSift for the gene_alignment marker" -l "gene_alignment" -f "/geomicro/data21/kevmey/share/phylosift/markers/gene_alignment/gene_alignment.clean" -t "/geomicro/data21/kevmey/share/phylosift/markers/gene_alignment/gene_alignment.pruned.tree" -s "/geomicro/data21/kevmey/share/phylosift/markers/gene_alignment/gene_alignment.pruned.log" -P "gene_alignment"

— Reply to this email directly or view it on GitHub https://github.com/gjospin/PhyloSift/issues/468.

KevinAMeyer commented 9 years ago

Hello Again,

Thanks for your response but I'm still having issues with the custom marker build. Specifically, I'm building a marker package and then testing that package against a genome that I extracted a gene from (positive control) and I'm getting back the error message:

rm: cannot remove `/geomicro/data21/kevmey/Research/cHABs/bioinformatics/Phylosift/Custom_marker_read/recA_genemarker/blastDir/.aa.1_': No such file or directory

This is perplexing because I know that one of the genes in the alignment I provided to the "phylosift build_marker" came from the very genome (.fna file) that I'm running "phylosift all --custom" on. Is this a problem with the build_marker process, the alignment file I provide it, or another issue entirely?

Cheers, Kevin

phylosift build_marker --force --alignment img_recA_CYANOS_muscle_edited2.fasta --tree_pd 0.01

Checking integrity on File img_recA_CYANOS_muscle_edited2.fasta Using /geomicro/data21/kevmey/share/phylosift/markers/img_recA_CYANOS_muscle_edited2/img_recA_CYANOS_muscle_edited2.aln ID_map is 101 long DNA alignment detected FastTree Version 2.1.3 SSE3 Alignment: /geomicro/data21/kevmey/share/phylosift/markers/img_recA_CYANOS_muscle_edited2/img_recA_CYANOS_muscle_edited2.clean Nucleotide distances: Jukes-Cantor Joins: balanced Support: SH-like 1000 Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1 TopHits: 1.00*sqrtN close=default refresh=0.80 ML Model: Generalized Time-Reversible, CAT approximation with 20 rate categories Initial topology in 0.13 seconds Refining topology: 26 rounds ME-NNIs, 2 rounds ME-SPRs, 13 rounds ML-NNIs Total branch-length 9.218 after 1.29 sec6, 1 of 93 splits
ML-NNI round 1: LogLk = -46132.320 NNIs 13 max delta 9.08 Time 1.92 GTR Frequencies: 0.2700 0.2173 0.2742 0.2384ep 12 of 12
GTR rates(ac ag at cg ct gt) 2.0984 2.9900 1.4591 2.2669 6.5212 1.0000 Switched to using 20 rate categories (CAT approximation)11 of 20
Rate categories were divided by 1.032 so that average rate = 1.0 CAT-based log-likelihoods may not be comparable across runs Use -gamma for approximate but comparable Gamma(20) log-likelihoods ML-NNI round 2: LogLk = -37261.652 NNIs 11 max delta 12.56 Time 4.22 ML-NNI round 3: LogLk = -37249.123 NNIs 2 max delta 1.81 Time 4.63 ML-NNI round 4: LogLk = -37245.205 NNIs 1 max delta 3.80 Time 4.85 ML-NNI round 5: LogLk = -37234.476 NNIs 0 max delta 0.00 Time 4.97 Turning off heuristics for final round of ML NNIs (converged) ML-NNI round 6: LogLk = -37183.101 NNIs 5 max delta 6.98 Time 5.77 (final) Optimize all lengths: LogLk = -37179.566 Time 5.96 Total time: 7.08 seconds Unique: 95/101 Bad splits: 6/92 Worst delta-LogLk 10.509 PDA - Phylogenetic Diversity Analyzer version 0.5.2 Copyright (c) 2006-2007 Bui Quang Minh, Steffen Klaere and Arndt von Haeseler.

Reading tree file /geomicro/data21/kevmey/share/phylosift/markers/img_recA_CYANOS_muscle_edited2/img_recA_CYANOS_muscle_edited2.tree ... Tree contains 101 taxa and 196 branches

Running PD algorithm on UNROOTED tree...

Greedy Algorithm... Time used: 0.000000 seconds.

Results are summarized in /geomicro/data21/kevmey/share/phylosift/markers/img_recA_CYANOS_muscle_edited2/img_recA_CYANOS_muscle_edited2.tree.pruning.log

DNA alignment detected FastTree Version 2.1.3 SSE3 Alignment: /geomicro/data21/kevmey/share/phylosift/markers/img_recA_CYANOS_muscle_edited2/img_recA_CYANOS_muscle_edited2.pruned.fasta Nucleotide distances: Jukes-Cantor Joins: balanced Support: SH-like 1000 Search: Normal +NNI +SPR (2 rounds range 10) +ML-NNI opt-each=1 TopHits: 1.00*sqrtN close=default refresh=0.80 ML Model: Generalized Time-Reversible, CAT approximation with 20 rate categories Initial topology in 0.11 seconds Refining topology: 26 rounds ME-NNIs, 2 rounds ME-SPRs, 13 rounds ML-NNIs Total branch-length 9.159 after 1.12 sec6, 1 of 81 splits
ML-NNI round 1: LogLk = -45551.964 NNIs 12 max delta 8.15 Time 1.68 GTR Frequencies: 0.2707 0.2196 0.2733 0.2364ep 11 of 12
GTR rates(ac ag at cg ct gt) 2.0782 2.9609 1.4628 2.2334 6.4271 1.0000 Switched to using 20 rate categories (CAT approximation)12 of 20
Rate categories were divided by 1.030 so that average rate = 1.0 CAT-based log-likelihoods may not be comparable across runs Use -gamma for approximate but comparable Gamma(20) log-likelihoods ML-NNI round 2: LogLk = -36755.230 NNIs 14 max delta 12.54 Time 3.72 ML-NNI round 3: LogLk = -36732.922 NNIs 2 max delta 1.91 Time 4.12 ML-NNI round 4: LogLk = -36732.407 NNIs 2 max delta 0.38 Time 4.34 ML-NNI round 5: LogLk = -36730.823 NNIs 1 max delta 1.11 Time 4.43 ML-NNI round 6: LogLk = -36726.799 NNIs 1 max delta 3.97 Time 4.50 ML-NNI round 7: LogLk = -36726.798 NNIs 0 max delta 0.00 Time 4.55 Turning off heuristics for final round of ML NNIs (converged) ML-NNI round 8: LogLk = -36687.740 NNIs 4 max delta 6.58 Time 5.24 (final) Optimize all lengths: LogLk = -36685.322 Time 5.41 Total time: 6.38 seconds Unique: 83/83 Bad splits: 5/80 Worst delta-LogLk 8.361 CLEAN_ALN : /geomicro/data21/kevmey/share/phylosift/markers/img_recA_CYANOS_muscle_edited2/img_recA_CYANOS_muscle_edited2.cleanLooking for representatives Running cd "/geomicro/data21/kevmey/share/phylosift/markers/img_recA_CYANOS_muscle_edited2";taxit create -c -d "Creating a reference package for PhyloSift for the img_recA_CYANOS_muscle_edited2 marker" -l "img_recA_CYANOS_muscle_edited2" -f "/geomicro/data21/kevmey/share/phylosift/markers/img_recA_CYANOS_muscle_edited2/img_recA_CYANOS_muscle_edited2.clean" -t "/geomicro/data21/kevmey/share/phylosift/markers/img_recA_CYANOS_muscle_edited2/img_recA_CYANOS_muscle_edited2.pruned.tree" -s "/geomicro/data21/kevmey/share/phylosift/markers/img_recA_CYANOS_muscle_edited2/img_recA_CYANOS_muscle_edited2.pruned.log" -P "img_recA_CYANOS_muscle_edited2"