Closed nicolazarte closed 10 months ago
After reviewing the genomes using NCBI Blast, I observed the presence of some eukaryotic sequences. It appears that these sequences were misclassified in prokaryotic bins, causing GTDB-tk to be unable to classify them.
To address this issue, I managed to proceed by excluding all lines corresponding to the misidentified genomes from the following files:
veba_output/annotation/output/annotations.proteins.tsv.gz
veba_output/cluster/output/global/identifier_mapping.proteins.tsv.gz
I achieved this by running the following command with both files and then replacing them:
zcat annotations.proteins.tsv.gz | awk '!/P5_S2_R__METABAT2__P.1__bin.20/ && !/M7_S10_R__CONCOCT__P.1__168/ && !/M10_1_S12_R__CONCOCT__P.1__46_sub/ && !/P10_S5_R__METABAT2__P.1__bin.28_sub/' > annotations.proteins.tsv_newfile
I acknowledge that this approach may result in some data loss, but I opted to continue with the pipeline to maintain progress. Any suggestions or observations are greatly appreciated.
Just seeing this now. I don't think those are eukaryotic organisms and I think those are archaea. Can you check this directory:
ls -lhS veba_output/classify/prokaryotic/intermediate/1__gtdbtk/
If so, I just fixed that error and testing the 1.5.0 update. Should be pushing this update Feb 1st.
Indeed, I checked the dir and inside the file gtdbtk.ar53.summary.tsv I found the problematic genomes.
1__gtdbtk]$ ls -lhS
total 44K
-rw-r--r--. 1 ec2-user ec2-user 11K Jan 22 17:13 gtdbtk.log
-rw-r--r--. 1 ec2-user ec2-user 6.4K Jan 22 17:13 gtdbtk.json
drwxr-xr-x. 2 ec2-user ec2-user 6.0K Jan 22 17:13 align
drwxr-xr-x. 3 ec2-user ec2-user 6.0K Jan 22 17:13 classify
drwxr-xr-x. 2 ec2-user ec2-user 6.0K Jan 22 17:13 identify
lrwxrwxrwx. 1 ec2-user ec2-user 34 Jan 22 17:13 gtdbtk.bac120.summary.tsv -> classify/gtdbtk.bac120.summary.tsv
lrwxrwxrwx. 1 ec2-user ec2-user 32 Jan 22 16:09 gtdbtk.ar53.summary.tsv -> classify/gtdbtk.ar53.summary.tsv
-rw-r--r--. 1 ec2-user ec2-user 0 Jan 22 15:33 gtdbtk.warnings.log
user_genome classification fastani_reference fastani_reference_radius fastani_taxonomy fastani_ani fastani_af closest_placement_reference closest_placement_radius >
M10_1_S12_R__CONCOCT__P.1__46_sub d__Archaea;p__Thermoproteota;c__Nitrososphaeria;o__Nitrososphaerales;f__Nitrosopumilaceae;g__Nitrosotenuis;s__ N/A N/A N/A N/A N/A GCA_005240>
M7_S10_R__CONCOCT__P.1__168 d__Archaea;p__Thermoproteota;c__Nitrososphaeria;o__Nitrososphaerales;f__Nitrosopumilaceae;g__Nitrosopumilus;s__ N/A N/A N/A N/A N/A GCF_003175215.1 95>
P10_S5_R__METABAT2__P.1__bin.28_sub d__Archaea;p__Halobacteriota;c__Halobacteria;o__Halobacteriales;f__Halalkalicoccaceae;g__Halalkalicoccus;s__ N/A N/A N/A N/A N/A N/A N/>
P5_S2_R__METABAT2__P.1__bin.20 d__Archaea;p__Thermoplasmatota;c__EX4484-6;o__EX4484-6;f__EX4484-6;g__EX4484-6;s__ N/A N/A N/A N/A N/A GCA_002254385.1 95.0 d__Archaea;p__Ther>
I'll wait for the update then, thanks again!
In the meantime, just concatenate those two together as the new taxonomy.tsv
for the prokaryotes. What ended up happening was that I hardcoded the bacteria and archaea summary file names from GTDB-Tk. In one of the updates, they changed the names up for the archaea and I didn't catch it. The new version should be able to handle any naming schemes they use as long as it follows [domain][marker_number].summary.tsv
. Apologies for any inconvenience but luckily VEBA is pretty modular so this change shouldn't have affected much. Though, you'll need to redo the HUMAnN db creation.
Closing this issue but feel free to open it up if you need more help.
Describe the bug: When trying to perform pathway profiling I can't complete it because the number of genomes in taxonomy does not match the number of genomes in "identifier_mapping". On further review I found that the missing genomes are binned on prokaryotes but don't get classified by GTDB-tk. How can I continue the pathway profiling?
Versions
Command used to produce error:
Please provide the following files:
1__gtdbtk.e.txt 1__gtdbtk.o.txt 1__gtdbtk.returncode.txt P5_S2_R_bins.list.txt taxonomy.tsv.txt
Thanks in advance