[REPORT] Updating anvi'o SCG taxonomy databases with new genes

So we recently decided to change the collection of SCGs in anvi'o that are used to infer taxonomy of genomes rapidly. The SCG data are coming from the GTDB, and the purpose and the utility of this functionality is explained here. The bottom-line is that it helps a lot during binning to have that real time taxonomy estimation:

But then we found other uses for the SCG taxonomy framework, including its use in @mschecht's EcoPhylo workflow. But every good thing leads to more pain, so of course Matt needed a DIFFERENT kinds of SCGs to be used in anvi'o SCG framework, because he found out that not all ribosomal proteins were created equal and their rate of assembly in metagenomes and distributions across MAGs were not identical (as also eloquently covered by Matt Olm and others). So he asked me to change the list of ribosomal proteins we have been using in anvi'o to infer taxonomy from the existing list of things that included these genes:

Ribosomal_S2
Ribosomal_S3_C
Ribosomal_S6
Ribosomal_S7
Ribosomal_S8
Ribosomal_S9
Ribosomal_S11
Ribosomal_S20p
Ribosomal_L1
Ribosomal_L2
Ribosomal_L3
Ribosomal_L4
Ribosomal_L6
Ribosomal_L9_C
Ribosomal_L13
Ribosomal_L16
Ribosomal_L17
Ribosomal_L20
Ribosomal_L21p
Ribosomal_L22
ribosomal_L24
Ribosomal_L27A

But Matt asked for the removal of the following ones from this list:

Ribosomal_L6: domain duplication, detected twice in genomic datasets
Ribosomal_S3_C: C terminus domain only, not full length
Ribosomal_L9_C: C terminus domain only, not full length
Ribosomal_S20p: low detection of GTDB bacteria
ribosomal_L24: low detection of GTDB bacteria

And the inclusion of these:

Ribosomal_L5: Covers Bacteria GTDB very well
Ribosomal_L19: well detected in human gut, human oral, GTDB Bacteria
Ribosomal_S15: well detected in human gut, human oral
Ribosomal_S16: well detected in human gut
Ribosomal_S2: well detected in human oral
Ribosomal_S11: well detected in oceans
Ribosomal_L14: well detected in oceans
Ribosomal_S8: well detected in oceans, GTDB bacteria AND archaea

With this request multiple changes needed to be made in anvi'o codebase, including the way we generate anvi'o search databases for each SCG so we can benefit from GTDB taxonomy.

How were we doing this?

To generate search databases for these genes, we were using GTDB's multiple sequencing alignments. Essentially GTDB team would use a set of HMMs to find homologous genes for generally single copy-core genes, align them all, and share them publicly in their FTP servers. And we would find which HMMs in our SCG collections would correspond to the HMMs they used to identify their genes, and associate the aligned sequences we obtain from GTDB with those HMMs.

For this ardous task, we had this YAML file that would be populated by anvi'o developers, and used by the program anvi-setup-scg-taxonomy:

(...)
v214.1:
  base_url: https://data.ace.uq.edu.au/public/gtdb/data/releases/release214/214.1
  files:
    VERSION: VERSION.txt
    MSA_ARCHAEA.tar.gz: genomic_files_reps/ar53_msa_marker_genes_reps_r214.tar.gz
    MSA_BACTERIA.tar.gz: genomic_files_reps/bac120_msa_marker_genes_reps_r214.tar.gz
    TAX_ARCHAEA.tsv: ar53_taxonomy_r214.tsv
    TAX_BACTERIA.tsv: bac120_taxonomy_r214.tsv
  genes:
    Ribosomal_S2:
      - ar53_r214_reps_TIGR01012.faa
      - bac120_r214_reps_TIGR01011.faa
    Ribosomal_S3_C:
      - ar53_r214_reps_TIGR01008.faa
      - bac120_r214_reps_TIGR01009.faa
    Ribosomal_S6:
      - bac120_r214_reps_TIGR00166.faa
    Ribosomal_S7:
      - ar53_r214_reps_TIGR01028.faa
      - bac120_r214_reps_TIGR01029.faa
    Ribosomal_S8:
      - ar53_r214_reps_PF00410.20.faa
      - bac120_r214_reps_PF00410.20.faa
    Ribosomal_S9:
      - ar53_r214_reps_TIGR03627.faa
      - bac120_r214_reps_PF00380.20.faa
    Ribosomal_S11:
      - ar53_r214_reps_TIGR03628.faa
      - bac120_r214_reps_TIGR03632.faa
    Ribosomal_S20p:
      - bac120_r214_reps_TIGR00029.faa
    Ribosomal_L1:
      - ar53_r214_reps_PF00687.22.faa
      - bac120_r214_reps_TIGR01169.faa
    Ribosomal_L2:
      - ar53_r214_reps_TIGR01171.faa
      - bac120_r214_reps_TIGR01171.faa
    Ribosomal_L3:
      - ar53_r214_reps_TIGR03626.faa
      - bac120_r214_reps_TIGR03625.faa
    Ribosomal_L4:
      - ar53_r214_reps_TIGR03672.faa
      - bac120_r214_reps_TIGR03953.faa
    Ribosomal_L6:
      - bac120_r214_reps_TIGR03654.faa
    Ribosomal_L9_C:
      - bac120_r214_reps_TIGR00158.faa
    Ribosomal_L13:
      - bac120_r214_reps_TIGR01066.faa
    Ribosomal_L16:
      - ar53_r214_reps_TIGR00279.faa
      - bac120_r214_reps_TIGR01164.faa
    Ribosomal_L17:
      - bac120_r214_reps_TIGR00059.faa
    Ribosomal_L20:
      - bac120_r214_reps_TIGR01032.faa
    Ribosomal_L21p:
      - bac120_r214_reps_TIGR00061.faa
    Ribosomal_L22:
      - bac120_r214_reps_TIGR01044.faa
    ribosomal_L24:
      - bac120_r214_reps_TIGR01079.faa
    Ribosomal_L27A:
      - bac120_r214_reps_TIGR01071.faa

v207.0:
  base_url: https://data.ace.uq.edu.au/public/gtdb/data/releases/release207/207.0
  files:
    VERSION: VERSION.txt
    MSA_ARCHAEA.tar.gz: genomic_files_reps/ar53_msa_marker_genes_reps_r207.tar.gz
    MSA_BACTERIA.tar.gz: genomic_files_reps/bac120_msa_marker_genes_reps_r207.tar.gz
    TAX_ARCHAEA.tsv: ar53_taxonomy_r207.tsv
    TAX_BACTERIA.tsv: bac120_taxonomy_r207.tsv
  genes:
    Ribosomal_S2:
      - ar53_r207_reps_TIGR01012.faa
(...)

This file described not only the exact names and locations of files on the GTDB FTP, but also the genes that will match to Ribosomal_S2 in our HMM collection were stored in files ar53_r214_reps_TIGR01012.faa and bac120_r214_reps_TIGR01011.faa because we figured out that PFAM model for Ribisomal S2 was related to TIGR01012 and TIGR01011.

The current version of this YAML file, GTDB-RELEASES.yaml, is here.

But there were significant problems with this approach. First, there were not MSA for every model we were interested in. Second, there were a lot of truly poor sequences in those aggregated data and while it was understandable to have them in a public resource (since those sequences did occur in genomes GTDB served), it was not appropriate for us to blindly include in our workflows that had to have higher standards for the accuracy of downstream analyses.

How are we going to do it going forward?

We came to the realization that instead of using GTDB MSA files for sequences that match to an arbitrary set of marker genes, we should identify sequences that match to our models from GTDB genomes directly. This idea (brilliance of which comes from its simplicity) emerged and was executed by @mschecht and @FlorianTrigodet, and we decided that the following workflow was much better than what we were doing with over-engineered code that included YAML files, explicit associations between models across different sources, user-side downloads, filtering of short or messy sequences, removal of gaps, etc:

Download all GTDB genomes on your server.
Generate anvi'o contigs-db files for each one of them using anvi-gen-contigs-database (during which identify genes and all).
Run anvi-run-hmms on each contigs-db.
Run anvi-get-seqeunces-for-hmm-hits to get all the sequences of interest for any given model.
In parallel, generate a two-column, TAB-delimited ACCESSION_TO_TAXONOMY.txt.gz file from the GTDB data where the first column is the accession ID and the second column is the semi-colon separated list of taxon names from phylum to species.
Manually move the compressed FASTA files (named identically to HMM models) as well as the ACCESSION_TO_TAXONOMY.txt.gz file under anvio/data/misc/SCG_TAXONOMY/GTDB.

Of course, now there are all the other things to consider. Such as the removal of all the code and files for the previous way of doing the SCG taxonomy, keeping up with versions, and making sure this significant change in the SCG collections will not ruin our existing contigs-db files with SCGs run.

There will be a PR to address these things that will be citing this issue, but the purpose of this issue is to take some notes so we can go back to later or discuss to improve in the long run.

Getting files prepared

Florian turned all GTDB genomes into contigs-db files. We have all the anvi'o recognized HMMs for bacterial single-copy core genes here:

$ awk '{if(NR>1)print $1}' anvio/data/hmm/Bacteria_71/genes.txt
Adenylsucc_synt
ADK
AICARFT_IMPCHas
ATP-synt
ATP-synt_A
Chorismate_synt
EF_TS
eIF-1a
Exonuc_VII_L
GrpE
Ham1p_like
IPPT
OSCP
Pept_tRNA_hydro
PGK
RBFA
RecO_C
Ribonuclease_P
Ribosomal_L1
Ribosomal_L2
Ribosomal_L13
Ribosomal_L14
Ribosomal_L16
Ribosomal_L17
Ribosomal_L18p
Ribosomal_L19
Ribosomal_L20
Ribosomal_L21p
Ribosomal_L22
Ribosomal_L23
ribosomal_L24
Ribosomal_L27
Ribosomal_L27A
Ribosomal_L28
Ribosomal_L29
Ribosomal_L3
Ribosomal_L32p
Ribosomal_L35p
Ribosomal_L4
Ribosomal_L5
Ribosomal_L6
Ribosomal_L9_C
Ribosomal_S10
Ribosomal_S11
Ribosomal_S13
Ribosomal_S15
Ribosomal_S16
Ribosomal_S17
Ribosomal_S19
Ribosomal_S2
Ribosomal_S3_C
Ribosomal_S6
Ribosomal_S7
Ribosomal_S8
Ribosomal_S9
Ribosomal_S20p
Ribosom_S12_S23
RNA_pol_L
RNA_pol_Rpb6
RRF
RsfS
RuvX
SecE
SecG
SecY
SmpB
tRNA-synt_1d
tRNA_m1G_MT
TsaE
UPF0054
YajC

So now, we get all the sequences from our contigs-db files from GTDB:


for model in `awk '{if(NR>1)print $1}' anvio/data/hmm/Bacteria_71/genes.txt`
do
    # ofc it is better to 'clusterize' this step and run them on an HPC
    anvi-get-sequences-for-hmm-hits -e external-genomes.txt \
                                    --hmm-source Bacteria_71 \
                                    --gene-name $model \
                                    --get-aa-sequences \
                                    --return-best-hit \
                                    --output-file $model
done

Where external-genomes.txt is an external-genomes-txt artifact to show where GTDB genomes are. This essentially gives the following FASTA files for each model.

And important step is to simplify the headers of the sequences in these FASTA files, which look like this since anvi-get-seqeunces-for-hmm-hits reports everything:

$ head Ribosomal_L14

>Ribosomal_L14___Bacteria_71___7ec547eee5474fa1eb0a6a2b788917d802e1b68043360e9281ba9441 bin_id:GCA_000007325_1|source:Bacteria_71|e_value:1.9e-55|contig:hash3ec68ff8_GCA_000007325_1_000000000001|gene_callers_id:hash3ec68ff8_126|start:136293|stop:136662|length:369
MVQQQTILNVADNSGAKKLMVIRVLGGSRKRFGKIGDIVVASVKEAIPGGNVKKGDIVKAVIVRTRKETRRDDGSYIKFDDNAGVVINNNNEPRATRIFGPVARELRARNFMKILSLAIE
VI
>Ribosomal_L14___Bacteria_71___e50700fcd71c23c1458317feed301b0be14f7a33521441b651d2aee8 bin_id:GCA_000008085_1|source:Bacteria_71|e_value:3.2e-34|contig:hash8955e19d_GCA_000008085_1_000000000001|gene_callers_id:hash8955e19d_97|start:87068|stop:87470|length:402
MKPISASIVRALPVGAYLNVADNSGAKVVKLIAVKGYKGRKRRLAKAGIADLVIVSVRDGKPDMIGQIFKAVVVRMKKEWRRRDGTRIKFEDNAVALLKDDYGTPKGTIIKTPIAKEVAE
RWPDLAKIARIIV
>Ribosomal_L14___Bacteria_71___567266c9e9cf5b6a6eaef01b1cfd8c5023edcd4a662b9feb45917cdb bin_id:GCA_000008885_1|source:Bacteria_71|e_value:6e-55|contig:hashd0fb4fc1_GCA_000008885_1_000000000001|gene_callers_id:hashd0fb4fc1_582|start:630920|stop:631292|length:372
MIQVQTILSVADNSGARLVMCIKVLGGSRRRYAKISDIIKIAVKEAIPRAKVKKGEVLKAVIVRTKKNLHRSDGSVLRFDKNACVLLNDSTEQPIGTRIFGPVTRELRTEKFMKIISLAP
EVL

But what we want is a defline that ONLY shows the bin_id information, so the sequence has an exact match in our accession ID to taxonomy file. The next step is a clean up for the deflines:

for model in `awk '{if(NR>1)print $1}' anvio/data/hmm/Bacteria_71/genes.txt`
do
    # this shouldn't take more than a few seconds
    sed -i '' 's/>.*bin_id:/>/g' $model
    sed -i '' 's/|.*$//g' $model
done

Now things look much better (and the deflines match to the first column entries in ACCESSION_TO_TAXONOMY.txt.gz):

>GCA_000007325_1
MVQQQTILNVADNSGAKKLMVIRVLGGSRKRFGKIGDIVVASVKEAIPGGNVKKGDIVKAVIVRTRKETRRDDGSYIKFDDNAGVVINNNNEPRATRIFGPVARELRARNFMKILSLAIE
VI
>GCA_000008085_1
MKPISASIVRALPVGAYLNVADNSGAKVVKLIAVKGYKGRKRRLAKAGIADLVIVSVRDGKPDMIGQIFKAVVVRMKKEWRRRDGTRIKFEDNAVALLKDDYGTPKGTIIKTPIAKEVAE
RWPDLAKIARIIV
>GCA_000008885_1
MIQVQTILSVADNSGARLVMCIKVLGGSRRRYAKISDIIKIAVKEAIPRAKVKKGEVLKAVIVRTKKNLHRSDGSVLRFDKNACVLLNDSTEQPIGTRIFGPVTRELRTEKFMKIISLAP
EVL

Next, we simply compress them:

for model in `awk '{if(NR>1)print $1}' anvio/data/hmm/Bacteria_71/genes.txt`
do
    gzip $model
done

At the end of which we have this directory with all the files:


$ ls

ADK.gz              GrpE.gz             RNA_pol_Rpb6.gz     Ribosomal_L16.gz   Ribosomal_L23.gz   Ribosomal_L4.gz    Ribosomal_S16.gz   Ribosomal_S8.gz  TsaE.gz
AICARFT_IMPCHas.gz  Ham1p_like.gz       RRF.gz              Ribosomal_L17.gz   Ribosomal_L27.gz   Ribosomal_L5.gz    Ribosomal_S17.gz   Ribosomal_S9.gz  UPF0054.gz
ATP-synt.gz         IPPT.gz             RecO_C.gz           Ribosomal_L18p.gz  Ribosomal_L27A.gz  Ribosomal_L6.gz    Ribosomal_S19.gz   RsfS.gz          YajC.gz
ATP-synt_A.gz       OSCP.gz             Ribonuclease_P.gz   Ribosomal_L19.gz   Ribosomal_L28.gz   Ribosomal_L9_C.gz  Ribosomal_S2.gz    RuvX.gz          eIF-1a.gz
Adenylsucc_synt.gz  PGK.gz              Ribosom_S12_S23.gz  Ribosomal_L2.gz    Ribosomal_L29.gz   Ribosomal_S10.gz   Ribosomal_S20p.gz  SecE.gz          ribosomal_L24.gz
Chorismate_synt.gz  Pept_tRNA_hydro.gz  Ribosomal_L1.gz     Ribosomal_L20.gz   Ribosomal_L3.gz    Ribosomal_S11.gz   Ribosomal_S3_C.gz  SecG.gz          tRNA-synt_1d.gz
EF_TS.gz            RBFA.gz             Ribosomal_L13.gz    Ribosomal_L21p.gz  Ribosomal_L32p.gz  Ribosomal_S13.gz   Ribosomal_S6.gz    SecY.gz          tRNA_m1G_MT.gz
Exonuc_VII_L.gz     RNA_pol_L.gz        Ribosomal_L14.gz    Ribosomal_L22.gz   Ribosomal_L35p.gz  Ribosomal_S15.gz   Ribosomal_S7.gz    SmpB.gz

Of course we only need some of these.

To update the FASTA files under anvio/data/misc/SCG_TAXONOMY/GTDB/SCG_SEARCH_DATABASES/, I first removed EVERYTHING there:

git rm anvio/data/misc/SCG_TAXONOMY/GTDB/SCG_SEARCH_DATABASES/*gz

And then started thinking about how to merge what we had with what is requested to be added. In anvio/constants.py we have the list of default SCGs for taxonomy:

default_scgs_for_taxonomy = ['Ribosomal_S2',
                             'Ribosomal_S3_C',
                             'Ribosomal_S6',
                             'Ribosomal_S7',
                             'Ribosomal_S8',
                             'Ribosomal_S9',
                             'Ribosomal_S11',
                             'Ribosomal_S20p',
                             'Ribosomal_L1',
                             'Ribosomal_L2',
                             'Ribosomal_L3',
                             'Ribosomal_L4',
                             'Ribosomal_L6',
                             'Ribosomal_L9_C',
                             'Ribosomal_L13',
                             'Ribosomal_L16',
                             'Ribosomal_L17',
                             'Ribosomal_L20',
                             'Ribosomal_L21p',
                             'Ribosomal_L22',
                             'ribosomal_L24',
                             'Ribosomal_L27A']

Now we want to exclude the following from that list:

exclude = ['Ribosomal_L6', 'Ribosomal_S3_C', 'Ribosomal_L9_C', 'Ribosomal_S20p', 'ribosomal_L24']

And the inclusion of these:

include = ['Ribosomal_L5', 'Ribosomal_L19', 'Ribosomal_S15', 'Ribosomal_S16', 'Ribosomal_S2', 'Ribosomal_S11', 'Ribosomal_L14', 'Ribosomal_S8']

Let's do it right in Python (becasue I just realized Matt asked models to be included in the final list that are already among the default models):

models = set(default_scgs_for_taxonomy) - set(exclude)
models.update()

So our final list of models are the following:

print(' '.join(models))

Ribosomal_L22 Ribosomal_L19 Ribosomal_L2 Ribosomal_L21p Ribosomal_S7 Ribosomal_S8 Ribosomal_L14 Ribosomal_L1 Ribosomal_L4 Ribosomal_L5 Ribosomal_S16 Ribosomal_L3 Ribosomal_L16 Ribosomal_S6 Ribosomal_S15 Ribosomal_L20 Ribosomal_L17 Ribosomal_S11 Ribosomal_L13 Ribosomal_S9 Ribosomal_S2 Ribosomal_L27A

Easy to copy pasta and export into a variable in BASH to minimize typos:

export models="Ribosomal_L22 Ribosomal_L19 Ribosomal_L2 Ribosomal_L21p Ribosomal_S7 Ribosomal_S8 Ribosomal_L14 Ribosomal_L1 Ribosomal_L4 Ribosomal_L5 Ribosomal_S16 Ribosomal_L3 Ribosomal_L16 Ribosomal_S6 Ribosomal_S15 Ribosomal_L20 Ribosomal_L17 Ribosomal_S11 Ribosomal_L13 Ribosomal_S9 Ribosomal_S2 Ribosomal_L27A"

And copy all the relevant FASTA files from GTDB genomes, wherever they are, to their new location:

for model in $models
do
    cp $model.gz ~/github/anvio/anvio/data/misc/SCG_TAXONOMY/GTDB/SCG_SEARCH_DATABASES/
done

The rest of the changes for this big update are in the codebase, and described in the PR #2210. Future updates will only require dealing with the GTDB files, and updating the contents of the VERSION file.

merenlab / anvio