smirarab / sepp

Ensemble of HMM methods (SEPP, TIPP, UPP)
GNU General Public License v3.0
87 stars 38 forks source link

fragment-insertion via sept in latest QIIME release requires a single SeppReferenceDatabase artifact, this tutorial outputs 3 files #78

Closed jgrembi closed 3 months ago

jgrembi commented 4 years ago

Hi @smirarab, some of my labmates have followed the instructions here to make a reference database of our own, which resulted in the 3 files you describe: original alignment, final tree and final info file. The previous version of fragment-insertion allowed the user to provide these three files, but the new version (2019.10 qiime release) only requests a single file, SeppReferenceDatabase. How can I make these 3 files into the single database artifact?
Thanks, jess

sjanssen2 commented 4 years ago

Hi @jgrembi, the Qiime2 core team decided to merge those three files into one artefact to avoid situations where users might combine mismatching versions. Could you maybe ask your question in the qiime2 forum: https://forum.qiime2.org/ as this regards only the q2 plugin, not Siavash's original code base. Or maybe the github repo page for the q2 plugin: https://github.com/qiime2/q2-fragment-insertion I am not 100% sure if you already exposed the functionality to import the three reference files into one artifact.

sjanssen2 commented 4 years ago

there is the SeppReferenceDirFmt if you ask for qiime tools import --show-importable-formats and SeppReferenceDatabase for qiime tools import --show-importable-types, but I don't see documentation about how to actually compose the command that would import all three files. Ping @thermokarst and @ebolyen

thermokarst commented 4 years ago

but I don't see documentation about how to actually compose the command that would import all three files

Check out this script for an example: https://github.com/qiime2/q2-fragment-insertion/blob/master/bin/import_references.py

FranckLejzerowicz commented 4 years ago

Hi @thermokarst, I noticed that the import command you provided is not working for raxml info outputs that - I think - are not properly formatted. Could it be that only those raxml_info files generated by a specific version of raxml are possibly usable for the making of a SeppReferenceDatabase?

I am working with qiime2-2020.8 and had success in re-generating the artifact for the distributed sepp-refs-silva-128.2.qza file, as follows:

qiime tools export --input-path sepp-refs-silva-128.qza --output-path sepp-refs-silva-128

and then:

qiime tools import --input-path sepp-refs-silva-128 --output-path sepp-refs-silva-128.2.qza --type SeppReferenceDatabase

Note the content of raxml-info.txt:

This is a RAxML_info file from a -f e run, automatically reformatted

IMPORTANT WARNING: Sequences AYWZ01000003.289952.291440 and AYWM01000003.317110.318724 are exactly identical

IMPORTANT WARNING: Sequences AYWZ01000003.289952.291440 and AYWY01000007.130624.132186 are exactly identical

IMPORTANT WARNING: Sequences AOQL01000111.3210.4799 and AOQI01000050.40615.42166 are exactly identical

IMPORTANT WARNING: Sequences AOQL01000111.3210.4799 and AOQF01000068.44124.45733 are exactly identical

IMPORTANT WARNING: Sequences KJ819948.1.2035 and EU295654.1.2035 are exactly identical

IMPORTANT WARNING: Sequences CIJV01000053.71.1639 and GBKB01000906.322.1853 are exactly identical

IMPORTANT WARNING: Sequences KM209255.204.1909 and JXRS01000009.94998.96519 are exactly identical

IMPORTANT WARNING: Sequences CP003679.279736.281336 and HM165228.2594.4130 are exactly identical

IMPORTANT WARNING: Sequences AB013180.4.2322 and AB293490.1.1835 are exactly identical

IMPORTANT WARNING: Sequences AOFL01019584.5105.6960 and AJSU01018419.1658.3439 are exactly identical

IMPORTANT WARNING: Sequences EF136918.1.1816 and EF136916.1.1766 are exactly identical

IMPORTANT WARNING: Sequences AMRP01000748.63.1859 and JMKK01009071.20191.22014 are exactly identical

IMPORTANT WARNING: Sequences FR717538.2.2163 and FR717537.2.2311 are exactly identical

IMPORTANT WARNING: Sequences U59066.1.2135 and AF026601.1.1720 are exactly identical

IMPORTANT WARNING: Sequences LVMQ01000015.1.1360 and LVPF01000010.3538.4915 are exactly identical

IMPORTANT WARNING: Sequences AWOK01009499.3538.5414 and KR780036.108404.110282 are exactly identical

IMPORTANT WARNING: Sequences CCUE01000001.2265584.2267283 and CCVB01000003.315854.317534 are exactly identical

IMPORTANT WARNING: Sequences AOQJ01000190.175.1743 and AOQK01000068.8142.9693 are exactly identical

IMPORTANT WARNING: Sequences KJ867127.1.1346 and KJ867148.1.1346 are exactly identical

IMPORTANT WARNING: Sequences JQ925246.1.1036 and JQ925220.1.1056 are exactly identical

IMPORTANT WARNING: Sequences AJSU01002193.6412.8209 and AJSU01020019.6785.8656 are exactly identical

IMPORTANT WARNING
Found 21 sequences that are exactly identical to other sequences in the alignment.
Normally they should be excluded from the analysis.

An alignment file with sequence duplicates removed has already
been printed to file 99_otus_aligned_masked1977.fasta.reduced

Using BFGS method to optimize GTR rate parameters, to disable this specify "--no-bfgs" 

This is RAxML version 8.2.3 released by Alexandros Stamatakis on August 12 2015.

With greatly appreciated code contributions by:
Andre Aberer      (HITS)
Simon Berger      (HITS)
Alexey Kozlov     (HITS)
Kassian Kobert    (HITS)
David Dao         (KIT and HITS)
Nick Pattengale   (Sandia)
Wayne Pfeiffer    (SDSC)
Akifumi S. Tanabe (NRIFS)

Alignment has 2349 distinct alignment patterns

Proportion of gaps and completely undetermined characters in this alignment: 39.17%

RAxML Model Optimization up to an accuracy of 0.100000 log likelihood units

Using 1 distinct models/data partitions with joint branch length optimization

All free model parameters will be estimated by RAxML
ML estimate of 25 per site rate categories

Likelihood of final tree will be evaluated and optimized under GAMMA

GAMMA Model parameters will be estimated up to an accuracy of 0.1000000000 Log Likelihood units

Partition: 0
Alignment Patterns: 2349
Name: No Name Provided
DataType: DNA
Substitution Matrix: GTR

RAxML was called as follows:

raxmlHPC-PTHREADS -s 99_otus_aligned_masked1977.fasta -m GTRCAT -n score-bl-99_otus_aligned_masked1977.fasta -F -f e -t RAxML_result.scoreF-99_otus_aligned_masked1977.fasta -T 24 -p 10625 

Base frequencies: 0.253383 0.227847 0.308181 0.210589

Inference[0]: Time 11515.273362 CAT-based likelihood -0000, best rearrangement setting 5
alpha[0]: 1.000000 rates[0] ac ag at cg ct gt: 0.909407 1.884930 1.277639 0.833205 3.120449 1.000000

NOT conducting any final model optimizations on all 1 trees under CAT-based model ....

Final GAMMA  likelihood: -74694172.668249

The very first line states that This is a RAxML_info file from a -f e run, automatically reformatted. I think this reformatting is critical for qiime2 to import a SeppReferenceDatabase artifact, right?

You got me: instead of using the python API as in the example you kindly shared above (https://github.com/qiime2/q2-fragment-insertion/blob/master/bin/import_references.py), I wanted to try my chance using qiime tools import on a home-brewed tree 😺

So, I have generated a raxml tree based on a pre-computed alignment using (note that it is a pity that qiime phylogeny raxml does not output the info file, but that is for another issue) using RAxML version 8.2.3, as for the sepp-refs-silva-128.2.qza above:

raxmlHPC-PTHREADS -s myalignment.phy  -m GTRGAMMA -n result -F -f e -t myguidetree.tre -T 6 -p 666

I obtain:

Using BFGS method to optimize GTR rate parameters, to disable this specify "--no-bfgs" 

This is the RAxML Master Pthread

This is RAxML Worker Pthread Number: 1

This is RAxML Worker Pthread Number: 4

This is RAxML Worker Pthread Number: 3

This is RAxML Worker Pthread Number: 2

This is RAxML Worker Pthread Number: 5

This is RAxML version 8.2.3 released by Alexandros Stamatakis on August 12 2015.

With greatly appreciated code contributions by:
Andre Aberer      (HITS)
Simon Berger      (HITS)
Alexey Kozlov     (HITS)
Kassian Kobert    (HITS)
David Dao         (KIT and HITS)
Nick Pattengale   (Sandia)
Wayne Pfeiffer    (SDSC)
Akifumi S. Tanabe (NRIFS)

Alignment has 738 distinct alignment patterns

Proportion of gaps and completely undetermined characters in this alignment: 51.55%

RAxML Model Optimization up to an accuracy of 0.100000 log likelihood units

Using 1 distinct models/data partitions with joint branch length optimization

All free model parameters will be estimated by RAxML
GAMMA model of rate heteorgeneity, ML estimate of alpha-parameter

GAMMA Model parameters will be estimated up to an accuracy of 0.1000000000 Log Likelihood units

Partition: 0
Alignment Patterns: 738
Name: No Name Provided
DataType: DNA
Substitution Matrix: GTR

RAxML was called as follows:

raxmlHPC-PTHREADS -s myalignment.phy  -m GTRGAMMA -n result -F -f e -t myguidetree.tre -T 6 -p 666

Model parameters (binary file format) written to: 
RAxML_binaryModelParameters.result
15.087412 -225550.719596

Overall Time for Tree Evaluation 15.114338
Final GAMMA  likelihood: -225550.719596

Number of free parameters for AIC-TEST(BR-LEN): 4374
Number of free parameters for AIC-TEST(NO-BR-LEN): 9

Model Parameters of Partition 0, Name: No Name Provided, Type of Data: DNA
alpha: 0.906663
Tree-Length: 253.809239
rate A <-> C: 1.119242
rate A <-> G: 1.540015
rate A <-> T: 1.146196
rate C <-> G: 1.028338
rate C <-> T: 2.398896
rate G <-> T: 1.000000

freq pi(A): 0.267846
freq pi(C): 0.223972
freq pi(G): 0.244260
freq pi(T): 0.263922

Final tree written to:                 RAxML_result.result
Execution Log File written to:         RAxML_log.result

As such, the error shows up:

$ qiime tools import --input-path SEPP --output-path SEPP.qza --type SeppReferenceDatabase
There was a problem importing SEPP:

  SEPP/raxml-info.txt is not a(n) RAxMLinfoFormat file:

  Missing structured content: "Base frequencies".

However, if I edit the raxml-info.txt fil, i.e. replacing:

Model parameters (binary file format) written to: 
RAxML_binaryModelParameters.result
15.087412 -225550.719596

Overall Time for Tree Evaluation 15.114338
Final GAMMA  likelihood: -225550.719596

Number of free parameters for AIC-TEST(BR-LEN): 4374
Number of free parameters for AIC-TEST(NO-BR-LEN): 9

Model Parameters of Partition 0, Name: No Name Provided, Type of Data: DNA
alpha: 0.906663
Tree-Length: 253.809239
rate A <-> C: 1.119242
rate A <-> G: 1.540015
rate A <-> T: 1.146196
rate C <-> G: 1.028338
rate C <-> T: 2.398896
rate G <-> T: 1.000000

freq pi(A): 0.267846
freq pi(C): 0.223972
freq pi(G): 0.244260
freq pi(T): 0.263922

Final tree written to:                 RAxML_result.result
Execution Log File written to:         RAxML_log.result

by

Final GAMMA  likelihood: -225550.719596

Base frequencies: 0.267846 0.223972 0.244260 0.263922

alpha[0]: 1.000000 rates[0] ac ag at cg ct gt: 1.119242 1.540015 1.146196 1.028338 2.398896 1.000000

then, it works!

However, I am not 100% sure that what I am doing is right. Could you confirm that only the Base frequencies from the error are used, and that reformatting these on the ac ag at cg ct gt one liner is an acceptable solution?

Otherwise, I'm interested by another solution!

Sorry for the long post, and for this in Siavash's code as this is more of a Q2 issue.. Thanks!

smirarab commented 3 years ago

This is the script I used to reformat RAxML files: https://github.com/smirarab/sepp/blob/master/sepp-package/buildref/reformat-info.py

More generally, this page may be of use: https://github.com/smirarab/sepp/tree/master/sepp-package/buildref