jolespin / veba

A modular end-to-end suite for in silico recovery, clustering, and analysis of prokaryotic, microeukaryotic, and viral genomes from metagenomes
GNU Affero General Public License v3.0
77 stars 9 forks source link

[Bug] classify-prokaryotic.py excludes archaeal genomes (Downstream Issue in building custom HUMAnN db) #48

Closed nicolazarte closed 10 months ago

nicolazarte commented 10 months ago

Describe the bug: When trying to perform pathway profiling I can't complete it because the number of genomes in taxonomy does not match the number of genomes in "identifier_mapping". On further review I found that the missing genomes are binned on prokaryotes but don't get classified by GTDB-tk. How can I continue the pathway profiling?

Versions

(VEBA-profile_env) [ec2-user@ip-172-31-82-237 workdir]$ profile-pathway.py -v
profile-pathway.py v2023.10.16

Command used to produce error:

zcat veba_output/cluster/output/global/identifier_mapping.proteins.tsv.gz  | cut -f1,3 | tail -n +2 | compile_custom_humann_database_from_annotations.py -a veba_output/annotation/output/annotations.proteins.tsv.gz -s veba_output/misc/all_genomes.all_proteins.lt100k.faa -o veba_output/misc/humann_uniref_annotations.tsv -t veba_output/classify/taxonomy_classifications.tsv
--identifier_mapping <_io.TextIOWrapper name='<stdin>' mode='r' encoding='utf-8'>
 * 2034593 proteins
 * 663 genomes
--annotations veba_output/annotation/output/annotations.proteins.tsv.gz
 * 1890856 proteins
 * 1007210 UniRef hits
--taxonomy veba_output/classify/taxonomy_classifications.tsv
 * 659 genomes
 * 411 taxonomic classifications
Calculating length of proteins: veba_output/misc/all_genomes.all_proteins.lt100k.faa: 2034593 Proteins [00:06, 323460.54 Proteins/s]
Traceback (most recent call last):
  File "/home/ec2-user/veba_efs/mambaforge/envs/VEBA-profile_env/bin/compile_custom_humann_database_from_annotations.py", line 119, in <module>
    main()
  File "/home/ec2-user/veba_efs/mambaforge/envs/VEBA-profile_env/bin/compile_custom_humann_database_from_annotations.py", line 95, in main
    assert A2 == C1, "Genomes in --identifier_mapping do not match genomes in --taxonomy.\n\nThe following genomes are specific to --identifier_mapping: {}\n\nThe following genomes are specific to --taxonomy".format("\n".join(A2 - C1), "\n".join(C1 - A2))
AssertionError: Genomes in --identifier_mapping do not match genomes in --taxonomy.

The following genomes are specific to --identifier_mapping: P5_S2_R__METABAT2__P.1__bin.20
M7_S10_R__CONCOCT__P.1__168
M10_1_S12_R__CONCOCT__P.1__46_sub
P10_S5_R__METABAT2__P.1__bin.28_sub

The following genomes are specific to --taxonomy

Please provide the following files:

1__gtdbtk.e.txt 1__gtdbtk.o.txt 1__gtdbtk.returncode.txt P5_S2_R_bins.list.txt taxonomy.tsv.txt

Thanks in advance

nicolazarte commented 10 months ago

After reviewing the genomes using NCBI Blast, I observed the presence of some eukaryotic sequences. It appears that these sequences were misclassified in prokaryotic bins, causing GTDB-tk to be unable to classify them.

To address this issue, I managed to proceed by excluding all lines corresponding to the misidentified genomes from the following files:

  1. veba_output/annotation/output/annotations.proteins.tsv.gz
  2. veba_output/cluster/output/global/identifier_mapping.proteins.tsv.gz

I achieved this by running the following command with both files and then replacing them:

zcat annotations.proteins.tsv.gz | awk '!/P5_S2_R__METABAT2__P.1__bin.20/ && !/M7_S10_R__CONCOCT__P.1__168/ && !/M10_1_S12_R__CONCOCT__P.1__46_sub/ && !/P10_S5_R__METABAT2__P.1__bin.28_sub/' > annotations.proteins.tsv_newfile

I acknowledge that this approach may result in some data loss, but I opted to continue with the pipeline to maintain progress. Any suggestions or observations are greatly appreciated.

jolespin commented 10 months ago

Just seeing this now. I don't think those are eukaryotic organisms and I think those are archaea. Can you check this directory:

ls -lhS veba_output/classify/prokaryotic/intermediate/1__gtdbtk/

If so, I just fixed that error and testing the 1.5.0 update. Should be pushing this update Feb 1st.

nicolazarte commented 10 months ago

Indeed, I checked the dir and inside the file gtdbtk.ar53.summary.tsv I found the problematic genomes.

1__gtdbtk]$ ls -lhS
total 44K
-rw-r--r--. 1 ec2-user ec2-user  11K Jan 22 17:13 gtdbtk.log
-rw-r--r--. 1 ec2-user ec2-user 6.4K Jan 22 17:13 gtdbtk.json
drwxr-xr-x. 2 ec2-user ec2-user 6.0K Jan 22 17:13 align
drwxr-xr-x. 3 ec2-user ec2-user 6.0K Jan 22 17:13 classify
drwxr-xr-x. 2 ec2-user ec2-user 6.0K Jan 22 17:13 identify
lrwxrwxrwx. 1 ec2-user ec2-user   34 Jan 22 17:13 gtdbtk.bac120.summary.tsv -> classify/gtdbtk.bac120.summary.tsv
lrwxrwxrwx. 1 ec2-user ec2-user   32 Jan 22 16:09 gtdbtk.ar53.summary.tsv -> classify/gtdbtk.ar53.summary.tsv
-rw-r--r--. 1 ec2-user ec2-user    0 Jan 22 15:33 gtdbtk.warnings.log
user_genome     classification  fastani_reference       fastani_reference_radius        fastani_taxonomy        fastani_ani     fastani_af      closest_placement_reference     closest_placement_radius  >
M10_1_S12_R__CONCOCT__P.1__46_sub       d__Archaea;p__Thermoproteota;c__Nitrososphaeria;o__Nitrososphaerales;f__Nitrosopumilaceae;g__Nitrosotenuis;s__  N/A     N/A     N/A     N/A     N/A     GCA_005240>
M7_S10_R__CONCOCT__P.1__168     d__Archaea;p__Thermoproteota;c__Nitrososphaeria;o__Nitrososphaerales;f__Nitrosopumilaceae;g__Nitrosopumilus;s__ N/A     N/A     N/A     N/A     N/A     GCF_003175215.1 95>
P10_S5_R__METABAT2__P.1__bin.28_sub     d__Archaea;p__Halobacteriota;c__Halobacteria;o__Halobacteriales;f__Halalkalicoccaceae;g__Halalkalicoccus;s__    N/A     N/A     N/A     N/A     N/A     N/A     N/>
P5_S2_R__METABAT2__P.1__bin.20  d__Archaea;p__Thermoplasmatota;c__EX4484-6;o__EX4484-6;f__EX4484-6;g__EX4484-6;s__      N/A     N/A     N/A     N/A     N/A     GCA_002254385.1 95.0    d__Archaea;p__Ther>

I'll wait for the update then, thanks again!

jolespin commented 10 months ago

In the meantime, just concatenate those two together as the new taxonomy.tsv for the prokaryotes. What ended up happening was that I hardcoded the bacteria and archaea summary file names from GTDB-Tk. In one of the updates, they changed the names up for the archaea and I didn't catch it. The new version should be able to handle any naming schemes they use as long as it follows [domain][marker_number].summary.tsv. Apologies for any inconvenience but luckily VEBA is pretty modular so this change shouldn't have affected much. Though, you'll need to redo the HUMAnN db creation.

jolespin commented 10 months ago

Closing this issue but feel free to open it up if you need more help.