So we recently decided to change the collection of SCGs in anvi'o that are used to infer taxonomy of genomes rapidly. The SCG data are coming from the GTDB, and the purpose and the utility of this functionality is explained here. The bottom-line is that it helps a lot during binning to have that real time taxonomy estimation:
But then we found other uses for the SCG taxonomy framework, including its use in @mschecht's EcoPhylo workflow. But every good thing leads to more pain, so of course Matt needed a DIFFERENT kinds of SCGs to be used in anvi'o SCG framework, because he found out that not all ribosomal proteins were created equal and their rate of assembly in metagenomes and distributions across MAGs were not identical (as also eloquently covered by Matt Olm and others). So he asked me to change the list of ribosomal proteins we have been using in anvi'o to infer taxonomy from the existing list of things that included these genes:
Ribosomal_S2
Ribosomal_S3_C
Ribosomal_S6
Ribosomal_S7
Ribosomal_S8
Ribosomal_S9
Ribosomal_S11
Ribosomal_S20p
Ribosomal_L1
Ribosomal_L2
Ribosomal_L3
Ribosomal_L4
Ribosomal_L6
Ribosomal_L9_C
Ribosomal_L13
Ribosomal_L16
Ribosomal_L17
Ribosomal_L20
Ribosomal_L21p
Ribosomal_L22
ribosomal_L24
Ribosomal_L27A
But Matt asked for the removal of the following ones from this list:
Ribosomal_L6: domain duplication, detected twice in genomic datasets
Ribosomal_S3_C: C terminus domain only, not full length
Ribosomal_L9_C: C terminus domain only, not full length
Ribosomal_S20p: low detection of GTDB bacteria
ribosomal_L24: low detection of GTDB bacteria
And the inclusion of these:
Ribosomal_L5: Covers Bacteria GTDB very well
Ribosomal_L19: well detected in human gut, human oral, GTDB Bacteria
Ribosomal_S15: well detected in human gut, human oral
Ribosomal_S16: well detected in human gut
Ribosomal_S2: well detected in human oral
Ribosomal_S11: well detected in oceans
Ribosomal_L14: well detected in oceans
Ribosomal_S8: well detected in oceans, GTDB bacteria AND archaea
With this request multiple changes needed to be made in anvi'o codebase, including the way we generate anvi'o search databases for each SCG so we can benefit from GTDB taxonomy.
How were we doing this?
To generate search databases for these genes, we were using GTDB's multiple sequencing alignments. Essentially GTDB team would use a set of HMMs to find homologous genes for generally single copy-core genes, align them all, and share them publicly in their FTP servers. And we would find which HMMs in our SCG collections would correspond to the HMMs they used to identify their genes, and associate the aligned sequences we obtain from GTDB with those HMMs.
For this ardous task, we had this YAML file that would be populated by anvi'o developers, and used by the program anvi-setup-scg-taxonomy:
This file described not only the exact names and locations of files on the GTDB FTP, but also the genes that will match to Ribosomal_S2 in our HMM collection were stored in files ar53_r214_reps_TIGR01012.faa and bac120_r214_reps_TIGR01011.faa because we figured out that PFAM model for Ribisomal S2 was related to TIGR01012 and TIGR01011.
The current version of this YAML file, GTDB-RELEASES.yaml, is here.
But there were significant problems with this approach. First, there were not MSA for every model we were interested in. Second, there were a lot of truly poor sequences in those aggregated data and while it was understandable to have them in a public resource (since those sequences did occur in genomes GTDB served), it was not appropriate for us to blindly include in our workflows that had to have higher standards for the accuracy of downstream analyses.
How are we going to do it going forward?
We came to the realization that instead of using GTDB MSA files for sequences that match to an arbitrary set of marker genes, we should identify sequences that match to our models from GTDB genomes directly. This idea (brilliance of which comes from its simplicity) emerged and was executed by @mschecht and @FlorianTrigodet, and we decided that the following workflow was much better than what we were doing with over-engineered code that included YAML files, explicit associations between models across different sources, user-side downloads, filtering of short or messy sequences, removal of gaps, etc:
In parallel, generate a two-column, TAB-delimited ACCESSION_TO_TAXONOMY.txt.gz file from the GTDB data where the first column is the accession ID and the second column is the semi-colon separated list of taxon names from phylum to species.
Manually move the compressed FASTA files (named identically to HMM models) as well as the ACCESSION_TO_TAXONOMY.txt.gz file under anvio/data/misc/SCG_TAXONOMY/GTDB.
Of course, now there are all the other things to consider. Such as the removal of all the code and files for the previous way of doing the SCG taxonomy, keeping up with versions, and making sure this significant change in the SCG collections will not ruin our existing contigs-db files with SCGs run.
There will be a PR to address these things that will be citing this issue, but the purpose of this issue is to take some notes so we can go back to later or discuss to improve in the long run.
Getting files prepared
Florian turned all GTDB genomes into contigs-db files. We have all the anvi'o recognized HMMs for bacterial single-copy core genes here:
So now, we get all the sequences from our contigs-db files from GTDB:
for model in `awk '{if(NR>1)print $1}' anvio/data/hmm/Bacteria_71/genes.txt`
do
# ofc it is better to 'clusterize' this step and run them on an HPC
anvi-get-sequences-for-hmm-hits -e external-genomes.txt \
--hmm-source Bacteria_71 \
--gene-name $model \
--get-aa-sequences \
--return-best-hit \
--output-file $model
done
Where external-genomes.txt is an external-genomes-txt artifact to show where GTDB genomes are. This essentially gives the following FASTA files for each model.
And important step is to simplify the headers of the sequences in these FASTA files, which look like this since anvi-get-seqeunces-for-hmm-hits reports everything:
$ head Ribosomal_L14
>Ribosomal_L14___Bacteria_71___7ec547eee5474fa1eb0a6a2b788917d802e1b68043360e9281ba9441 bin_id:GCA_000007325_1|source:Bacteria_71|e_value:1.9e-55|contig:hash3ec68ff8_GCA_000007325_1_000000000001|gene_callers_id:hash3ec68ff8_126|start:136293|stop:136662|length:369
MVQQQTILNVADNSGAKKLMVIRVLGGSRKRFGKIGDIVVASVKEAIPGGNVKKGDIVKAVIVRTRKETRRDDGSYIKFDDNAGVVINNNNEPRATRIFGPVARELRARNFMKILSLAIE
VI
>Ribosomal_L14___Bacteria_71___e50700fcd71c23c1458317feed301b0be14f7a33521441b651d2aee8 bin_id:GCA_000008085_1|source:Bacteria_71|e_value:3.2e-34|contig:hash8955e19d_GCA_000008085_1_000000000001|gene_callers_id:hash8955e19d_97|start:87068|stop:87470|length:402
MKPISASIVRALPVGAYLNVADNSGAKVVKLIAVKGYKGRKRRLAKAGIADLVIVSVRDGKPDMIGQIFKAVVVRMKKEWRRRDGTRIKFEDNAVALLKDDYGTPKGTIIKTPIAKEVAE
RWPDLAKIARIIV
>Ribosomal_L14___Bacteria_71___567266c9e9cf5b6a6eaef01b1cfd8c5023edcd4a662b9feb45917cdb bin_id:GCA_000008885_1|source:Bacteria_71|e_value:6e-55|contig:hashd0fb4fc1_GCA_000008885_1_000000000001|gene_callers_id:hashd0fb4fc1_582|start:630920|stop:631292|length:372
MIQVQTILSVADNSGARLVMCIKVLGGSRRRYAKISDIIKIAVKEAIPRAKVKKGEVLKAVIVRTKKNLHRSDGSVLRFDKNACVLLNDSTEQPIGTRIFGPVTRELRTEKFMKIISLAP
EVL
But what we want is a defline that ONLY shows the bin_id information, so the sequence has an exact match in our accession ID to taxonomy file. The next step is a clean up for the deflines:
for model in `awk '{if(NR>1)print $1}' anvio/data/hmm/Bacteria_71/genes.txt`
do
# this shouldn't take more than a few seconds
sed -i '' 's/>.*bin_id:/>/g' $model
sed -i '' 's/|.*$//g' $model
done
Now things look much better (and the deflines match to the first column entries in ACCESSION_TO_TAXONOMY.txt.gz):
>GCA_000007325_1
MVQQQTILNVADNSGAKKLMVIRVLGGSRKRFGKIGDIVVASVKEAIPGGNVKKGDIVKAVIVRTRKETRRDDGSYIKFDDNAGVVINNNNEPRATRIFGPVARELRARNFMKILSLAIE
VI
>GCA_000008085_1
MKPISASIVRALPVGAYLNVADNSGAKVVKLIAVKGYKGRKRRLAKAGIADLVIVSVRDGKPDMIGQIFKAVVVRMKKEWRRRDGTRIKFEDNAVALLKDDYGTPKGTIIKTPIAKEVAE
RWPDLAKIARIIV
>GCA_000008885_1
MIQVQTILSVADNSGARLVMCIKVLGGSRRRYAKISDIIKIAVKEAIPRAKVKKGEVLKAVIVRTKKNLHRSDGSVLRFDKNACVLLNDSTEQPIGTRIFGPVTRELRTEKFMKIISLAP
EVL
Next, we simply compress them:
for model in `awk '{if(NR>1)print $1}' anvio/data/hmm/Bacteria_71/genes.txt`
do
gzip $model
done
At the end of which we have this directory with all the files:
And then started thinking about how to merge what we had with what is requested to be added. In anvio/constants.py we have the list of default SCGs for taxonomy:
And copy all the relevant FASTA files from GTDB genomes, wherever they are, to their new location:
for model in $models
do
cp $model.gz ~/github/anvio/anvio/data/misc/SCG_TAXONOMY/GTDB/SCG_SEARCH_DATABASES/
done
The rest of the changes for this big update are in the codebase, and described in the PR #2210. Future updates will only require dealing with the GTDB files, and updating the contents of the VERSION file.
So we recently decided to change the collection of SCGs in anvi'o that are used to infer taxonomy of genomes rapidly. The SCG data are coming from the GTDB, and the purpose and the utility of this functionality is explained here. The bottom-line is that it helps a lot during binning to have that real time taxonomy estimation:
But then we found other uses for the SCG taxonomy framework, including its use in @mschecht's EcoPhylo workflow. But every good thing leads to more pain, so of course Matt needed a DIFFERENT kinds of SCGs to be used in anvi'o SCG framework, because he found out that not all ribosomal proteins were created equal and their rate of assembly in metagenomes and distributions across MAGs were not identical (as also eloquently covered by Matt Olm and others). So he asked me to change the list of ribosomal proteins we have been using in anvi'o to infer taxonomy from the existing list of things that included these genes:
But Matt asked for the removal of the following ones from this list:
And the inclusion of these:
With this request multiple changes needed to be made in anvi'o codebase, including the way we generate anvi'o search databases for each SCG so we can benefit from GTDB taxonomy.
How were we doing this?
To generate search databases for these genes, we were using GTDB's multiple sequencing alignments. Essentially GTDB team would use a set of HMMs to find homologous genes for generally single copy-core genes, align them all, and share them publicly in their FTP servers. And we would find which HMMs in our SCG collections would correspond to the HMMs they used to identify their genes, and associate the aligned sequences we obtain from GTDB with those HMMs.
For this ardous task, we had this YAML file that would be populated by anvi'o developers, and used by the program
anvi-setup-scg-taxonomy
:This file described not only the exact names and locations of files on the GTDB FTP, but also the genes that will match to
Ribosomal_S2
in our HMM collection were stored in filesar53_r214_reps_TIGR01012.faa
andbac120_r214_reps_TIGR01011.faa
because we figured out that PFAM model for Ribisomal S2 was related toTIGR01012
andTIGR01011
.The current version of this YAML file,
GTDB-RELEASES.yaml
, is here.But there were significant problems with this approach. First, there were not MSA for every model we were interested in. Second, there were a lot of truly poor sequences in those aggregated data and while it was understandable to have them in a public resource (since those sequences did occur in genomes GTDB served), it was not appropriate for us to blindly include in our workflows that had to have higher standards for the accuracy of downstream analyses.
How are we going to do it going forward?
We came to the realization that instead of using GTDB MSA files for sequences that match to an arbitrary set of marker genes, we should identify sequences that match to our models from GTDB genomes directly. This idea (brilliance of which comes from its simplicity) emerged and was executed by @mschecht and @FlorianTrigodet, and we decided that the following workflow was much better than what we were doing with over-engineered code that included YAML files, explicit associations between models across different sources, user-side downloads, filtering of short or messy sequences, removal of gaps, etc:
ACCESSION_TO_TAXONOMY.txt.gz
file from the GTDB data where the first column is the accession ID and the second column is the semi-colon separated list of taxon names from phylum to species.ACCESSION_TO_TAXONOMY.txt.gz
file underanvio/data/misc/SCG_TAXONOMY/GTDB
.Of course, now there are all the other things to consider. Such as the removal of all the code and files for the previous way of doing the SCG taxonomy, keeping up with versions, and making sure this significant change in the SCG collections will not ruin our existing contigs-db files with SCGs run.
There will be a PR to address these things that will be citing this issue, but the purpose of this issue is to take some notes so we can go back to later or discuss to improve in the long run.
Getting files prepared
Florian turned all GTDB genomes into contigs-db files. We have all the anvi'o recognized HMMs for bacterial single-copy core genes here:
So now, we get all the sequences from our contigs-db files from GTDB:
Where
external-genomes.txt
is an external-genomes-txt artifact to show where GTDB genomes are. This essentially gives the following FASTA files for each model.And important step is to simplify the headers of the sequences in these FASTA files, which look like this since anvi-get-seqeunces-for-hmm-hits reports everything:
But what we want is a defline that ONLY shows the
bin_id
information, so the sequence has an exact match in our accession ID to taxonomy file. The next step is a clean up for the deflines:Now things look much better (and the deflines match to the first column entries in
ACCESSION_TO_TAXONOMY.txt.gz
):Next, we simply compress them:
At the end of which we have this directory with all the files:
Of course we only need some of these.
To update the FASTA files under
anvio/data/misc/SCG_TAXONOMY/GTDB/SCG_SEARCH_DATABASES/
, I first removed EVERYTHING there:And then started thinking about how to merge what we had with what is requested to be added. In
anvio/constants.py
we have the list of default SCGs for taxonomy:Now we want to exclude the following from that list:
And the inclusion of these:
Let's do it right in Python (becasue I just realized Matt asked models to be included in the final list that are already among the default models):
So our final list of models are the following:
Easy to copy pasta and export into a variable in BASH to minimize typos:
And copy all the relevant FASTA files from GTDB genomes, wherever they are, to their new location:
The rest of the changes for this big update are in the codebase, and described in the PR #2210. Future updates will only require dealing with the GTDB files, and updating the contents of the
VERSION
file.