liberjul / CONSTAXv2

MIT License
8 stars 2 forks source link

Problems at CombineTaxonomy step #2

Closed andnischneider closed 3 years ago

andnischneider commented 3 years ago

Hi I have been trying to run constax on the server we have at our institute (Ubuntu 18.04.5). I set up a conda environment using miniconda3, and after manually specifying the pathfile, I got it to run. It seems to do fine until it runs into some kind of error at the classification step using vsearch, where a file ("genus_wordConditionalProbList.txt") cannot be found (which might be something that propagates from the previous step, see attached file). This leads to a fatal error in the step after that (Combining taxonomy). I have attached my stdout and stderr in a txt file, let me know if you need any other info! out.txt

Best Andreas

Gian77 commented 3 years ago

Hello @andnischneider

Thanks for using CONSTAX. As I can see from your out.txt file the problem is related to the original Unite DB and the RDP classifier rather than CONSTAX itself. See here

Training RDP Classifier
edu.msu.cme.rdp.classifier.train.NameRankDupException: Error: duplicate taxon name and rank in the taxonomy file.
cylindrium  genus   2
cenangiopsis    genus   2
aleurina    genus   2
brevicollum genus   2
cryptococcus    genus   2

Basically RDP does not admit a genus to be included in different families (convergent evolution in evolutionary term) and UNITE comes with few of these inside it. For example, some SH representative sequences of Cryptococcus are in the Tremellaceae some other in the Cryptococcaceae. Since RDP is not able to make a taxonomy then the consensus cannot be generated and run into an error.

The solution is to correct those errors int UNITE db before running CONSTAX again. You can do it easily using gedit since you are on ubuntu (or another text editor) and make sure all the SH representative sequences for each of those genera have the same hierarchy before the genus level.

Hope it helps.

Gian

andnischneider commented 3 years ago

Changing the UNITE entries from the listed genera to be from the same family seems to have solved the issue, thanks!

andnischneider commented 3 years ago

Hello again,

The training seems to work okay after fixing the UNITE db issue, but I now get stuck at the classification and/or combining taxonomy stages (see attached file again). From what I can tell there seems to be a problem with the formatting of the sintax output file?

Cheers Andreas out.txt

Gian77 commented 3 years ago

Hello @andnischneider

Can you please try to change the CONSTAX PATH you added in your pathfile.txt to

CONSTAX: /mnt/picea/home/aschneider/miniconda3/envs/constax/opt/constax-2.0.6/

and rerun?

Cheers,

Gian

andnischneider commented 3 years ago

Hi @Gian77 ,

Thanks for your response. I already did change the CONSTAX PATH in the pathfile before my first message, due to a previous error message regarding the constax path.. Just double checked and it's already set to exactly what you specified. I am also confused as to where the "SINTAX executable does not match the executable used to generate the training files" message comes from, maybe I need to completely purge everything from the installation and rerun again with a cleaned db?

Best Andreas

Gian77 commented 3 years ago

Try to re-train Unite -t, first. Also can you send me the command you are using to run it? Thanks,

Gian

andnischneider commented 3 years ago

I have attached my out file including the training step and the pathfile, and here is the shell script I used to run constax:

`#!/bin/bash

home=/mnt/picea/home/aschneider

constax \ --num_threads=8 \ --mem=32000 \ --db=${home}/constax/db/unite_clean.fasta \ --input=${home}/constax/tutorial/ITS2_soil_500_otu.fasta \ --trainfile=${home}/constax/tutorial/training_files/ \ --tax=${home}/constax/tutorial/taxonomy/ \ --output=${home}/constax/tutorial/taxonomy/ \ --pathfile=${home}/miniconda3/envs/constax/opt/constax-2.0.6/pathfile.txt \ --conf=0.8 \ --blast \ --train`

Cheers, Andreas

pathfile.txt

out.txt

liberjul commented 3 years ago

Hi Andreas,

Could you upload the file at ${home}/constax/tutorial/taxonomy/otu_taxonomy.sintax? For some reason it is failing the formatting check. It may still be valid but the formatting check could be failing unnecessarily.

As for the initial duplicate taxa problem, the training approach for SILVA databases already accounts for that and I will implement the same approach with UNITE databases if duplicate taxa are detected.

Thanks, Julian

andnischneider commented 3 years ago

Hi Julian

Thanks!

I have attached the file you requested. For some reason only one of the OTUs (these are from one of the constax examples files) got any assignment at all, is that normal?

Best Andreas

otu_taxonomy.sintax.zip

liberjul commented 3 years ago

It appears that only 1 OTU was classified. Some possibilities:

Where was the training database from (do you have the link)? Can you check if the size is as expected based on the one you downloaded?

I will need to fix the formatting check to allow for an unclassified first OTU.

Julian

andnischneider commented 3 years ago

Hello again,

I don't believe it is the database, RDP assigned taxonomy to most of the OTUs, I have attached my blast and RDP files output. I used the UNITE fasta release (without singletons) available for download here: https://plutof.ut.ee/#/doi/10.15156/BIO/786368 , and then only changed a few of the sequence names to account for the issue that some genera are listed in 2 different families. The sequence length distribution looks normal to me, and I did not tamper with the sequences. How many of the OTUs from your test file (ITS2_soil_500_otus.fasta) get classified with the three methods when you run constax?

Best Andreas

Archive.zip

liberjul commented 3 years ago

Unfortunately I could not replicate your error. When I classified with the given database and otus I had identification for most of them. I would recommend trying: 1) Update CONSTAX with conda install constax -c liberjul (the bioconda update is being approved). 2) Redownload the same database you shared the link to. 3) Run CONSTAX with same settings as before.

I may need to change how the pathfile if the installation is within an environment, so you should probably supply the path.

andnischneider commented 3 years ago

Hello again,

I have finally had time to look at this again, and it seems like you were kind of right earlier, about the truncated sequence lengths. I used R to correct the "double" taxonomies in the database, and when writing the corrected sequence set to a fasta file, it set the line width to 80 by default. This worked fine for the RDP and blast parts, but sintax then seems to take only the first line from every entry when reformatting the database (in this case the 80 first bases), which leads to the problems I saw previously. It now makes sense that nothing gets assigned when it only uses a database with the 80 first bases to assign taxonomy to ITS2 region OTUs! After making sure the sequences all end up in a single line, it now runs through all the way and I get meaningful results to work with :) I have tried it on some of my own data and it seems to perform equally well there. Thanks again for your support!

Cheers Andreas