Silva database? - Githubissues

davidvilanova commented 1 year ago

Hi, Thanks for the gtdb-taxdump. I´m working on Silva , is there any available tax dump ?

shenwei356 commented 1 year ago

No existing one, but you can create it by yourself following the example, using the binaries here.

davidvilanova commented 1 year ago

Great,

I have tried so far, see shot below.

The taxonomy is two columns , fist column is accession and seconds columns are taxonomy as follows:

129138 Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Pseudomonadaceae;Pseudomonas;Pseudomonas amygdali

What is wrong ? there are two columns in the taxonomy file.

shenwei356 commented 1 year ago

Taxa should be tab-separated, as shown in Example 4.

Prepare lineages

$ echo -e "129138\tBacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Pseudomonadaceae;Pseudomonas;Pseudomonas amygdali" \
    | sed 's/;/\t/g' \
    | cut -f 2-8
Bacteria        Proteobacteria  Gammaproteobacteria     Pseudomonadales Pseudomonadaceae        Pseudomonas     Pseudomonas amygdali

Create taxdump

$ echo -e "129138\tBacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Pseudomonadaceae;Pseudomonas;Pseudomonas amygdali" \
    | sed 's/;/\t/g' \
    | cut -f 2-8 \
    | taxonkit create-taxdump -R "superkingdom,phylum,class,order,famil,genus,species" -O taxdump

Test

$ taxonkit list -n -r --ids 1 --data-dir taxdump/
1 [no rank] root
  609216830 [superkingdom] Bacteria
    1641076285 [phylum] Proteobacteria
      329474883 [class] Gammaproteobacteria
        86398254 [order] Pseudomonadales
          1478401337 [famil] Pseudomonadaceae
            1616653803 [genus] Pseudomonas
              691281667 [species] Pseudomonas amygdali

shenwei356 commented 1 year ago

If you need to map the Silver TaxIDs to the created ones.

$ echo -e "129138\tBacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Pseudomonadaceae;Pseudomonas;Pseudomonas amygdali" \
    | sed 's/;/\t/g' \
    | taxonkit create-taxdump -R "superkingdom,phylum,class,order,famil,genus,species" -O taxdump2 -A 1

$ tree taxdump2/
taxdump2/
├── delnodes.dmp
├── merged.dmp
├── names.dmp
├── nodes.dmp
└── taxid.map

$ cat taxdump2/taxid.map
129138  691281667

davidvilanova commented 1 year ago

Great !!! I have generated the database, can i use now with taxonkit -lca ??

shenwei356 commented 1 year ago

Sure, please follow the usage and examples.

davidvilanova commented 1 year ago

I have the following problem. I have added a new database using the workflow above and generated the Silva to TaxIDs.

For a particular id (1824050977) i don´t have a corresponding mapping (see shot). However it is in names.dmp.

The taxid map was generated the above post with option "-A 1"

Why it is missing ?

shenwei356 commented 1 year ago

It just maps custom IDs to TaxIDs of the taxa of the lowest rank, e.g. species.

If you need to map to taxa of other ranks, like the genus Staphylococcus. Here's the way, csvtk is needed.

cd taxdmp;

cat taxid.map \
    | taxonkit lineage --data-dir . -i 2 -t -d , \
    | cut -f 1,4 \
    | csvtk unfold -Ht -f 2 -s ,

129138  609216830
129138  1641076285
129138  329474883
129138  86398254
129138  1478401337
129138  1616653803
129138  691281667

davidvilanova commented 1 year ago

Great, One more comment. When using all ranks i will have a problem to get a proper taxonkit lca assignment because by default the upper levels will be picked up. That mean for a particular list of ids all with get a Bacteria assigment. Don´t know if that can be fixed ?

In this case for example the 609216830 that is a Streptoccus is matched to Bacteria. I would expect an assignation to the genus family/genus level at least.

shenwei356 commented 1 year ago

So, mapping only IDs to taxa of species rank is reasonable.

Back to the previous concern, why did you want map IDs to the genus Staphylococcus?

For a particular id (1824050977) i don´t have a corresponding mapping (see shot)

Mapping to species is enough, cause the complete lineage of the species can be retrieved with the taxid of the species.

Could you please show how you query the LCA? What were the TaxIDs used? What's the direct purpose?

davidvilanova commented 1 year ago

Yes sure, What i´m trying to achieve is to annotate with Silva a set of nanopore reads. For that purpose i´m mapping one metagenomic sample reads to the Silva 16S database with minimap allowing for error tolerance as nanopore reads tend to give an error rate. For each read mapped to silva i get the first 10 hits.

So for each read i use LCA to assign the upper common hit from the 10 hits, so for one read i get one hit.

Once i get that hit i would like to resolve the taxonomy (phylum,class...order.....species)

For some reads the resolution (because of the error rate) cannot be achieved at the species level, that´s why in some cases(like the Staphyloccocus case above) the best hit (output from LCA) is a genus. But could be family or order or class or phyla.

Does this makes sense?

shenwei356 commented 1 year ago

So for each read i use LCA to assign the upper common hit from the 10 hits, so for one read i get one hit.

Good, here you've assigned LCA to each read.

Once i get that hit i would like to resolve the taxonomy (phylum,class...order.....species)

No, you can just use taxonkit lineage or taxonkit reformat -I to retrieve lineage via the LCAs, no matter what the rank they are. There's no need to query with the taxid.map file.

davidvilanova commented 1 year ago

Thanks for the update, there is one strange behaviour for lca. I was expecting LCA to go up to the staphylococcus genus level but it went up to bacilli. I assume something is wrong with my database.

Here is the pipeline to build the database:

wget https://www.arb-silva.de/fileadmin/silva_databases/release_138.1/Exports/SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz

gunzip  SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz 
infile="SILVA_138.1_SSURef_NR99_tax_silva.fasta"

#Remove unwanted taxa
zcat  ${infile} | grep ">" |grep  -v 'Archaea\Eukaryota\|unidentified\|metagenomes\|environmental samples' >tab_silva_taxonomy

sed -i 's/>//g' tab_silva_taxonomy #replace ">"
sed -i 's/ /@@/' tab_silva_taxonomy # replace first space with @@
sed -i 's/ /_/g' tab_silva_taxonomy # replace all spaces with "_" mainly on species names
sed -i 's/@@/'$'\t''/' tab_silva_taxonomy # restore tab after accession
sed -i 's/;/'$'\t''/g' tab_silva_taxonomy # replace commas with semicolons
awk 'NF>7' tab_silva_taxonomy | cut -f 1-8 > tmp
mv tmp tab_silva_taxonomy
taxonkit create-taxdump tab_silva_taxonomy -O taxdump --force -RR "superkingdom,phylum,class,order,family,genus,species" -A 1

Don´t what is wrong...

shenwei356 commented 1 year ago

$ echo 883645126 1202746109 883645126 1956460793 883645126 1956460793 883645126 485431882 \
    | sed -E 's/\s+/\n/g' \
    | taxonkit lineage  --data-dir taxdump/ -t \
    | cut -f 3
609216830;1494978361;1845768359;813944714;671290804;690796498;883645126
609216830;1494978361;1845768359;813944714;95949142;368069282;1202746109
609216830;1494978361;1845768359;813944714;671290804;690796498;883645126
609216830;1494978361;1845768359;422354816;1997712377;1824050977;1956460793
609216830;1494978361;1845768359;813944714;671290804;690796498;883645126
609216830;1494978361;1845768359;422354816;1997712377;1824050977;1956460793
609216830;1494978361;1845768359;813944714;671290804;690796498;883645126
609216830;1494978361;1845768359;422354816;1997712377;1824050977;485431882

$ echo 883645126 1202746109 883645126 1956460793 883645126 1956460793 883645126 485431882 \
    | sed -E 's/\s+/\n/g' \
    | taxonkit lineage  --data-dir taxdump/ -t \
    | cut -f 2
Bacteria;Firmicutes;Bacilli;Bacillales;Bacillaceae;Bacillus;Streptococcus_pneumoniae
Bacteria;Firmicutes;Bacilli;Bacillales;Planococcaceae;Sporosarcina;Staphylococcus_saprophyticus
Bacteria;Firmicutes;Bacilli;Bacillales;Bacillaceae;Bacillus;Streptococcus_pneumoniae
Bacteria;Firmicutes;Bacilli;Staphylococcales;Staphylococcaceae;Staphylococcus;Staphylococcus_sciuri
Bacteria;Firmicutes;Bacilli;Bacillales;Bacillaceae;Bacillus;Streptococcus_pneumoniae
Bacteria;Firmicutes;Bacilli;Staphylococcales;Staphylococcaceae;Staphylococcus;Staphylococcus_sciuri
Bacteria;Firmicutes;Bacilli;Bacillales;Bacillaceae;Bacillus;Streptococcus_pneumoniae
Bacteria;Firmicutes;Bacilli;Staphylococcales;Staphylococcaceae;Staphylococcus;Staphylococcus_sp._mixed_culture_J3-

$ echo 883645126 1202746109 883645126 1956460793 883645126 1956460793 883645126 485431882 \
    | sed -E 's/\s+/\n/g' \
    | sort | uniq  \
    | axonkit reformat --data-dir taxdump/ -I 1 -f '{k}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}' \
    | csvtk add-header -t -n "taxid,kingdom,phylum,class,order,family,genus,species"  \
    | csvtk pretty -t
taxid        kingdom    phylum       class     order              family              genus            species
----------   --------   ----------   -------   ----------------   -----------------   --------------   -------------------------------------
1202746109   Bacteria   Firmicutes   Bacilli   Bacillales         Planococcaceae      Sporosarcina     Staphylococcus_saprophyticus
1956460793   Bacteria   Firmicutes   Bacilli   Staphylococcales   Staphylococcaceae   Staphylococcus   Staphylococcus_sciuri
485431882    Bacteria   Firmicutes   Bacilli   Staphylococcales   Staphylococcaceae   Staphylococcus   Staphylococcus_sp._mixed_culture_J3-3
883645126    Bacteria   Firmicutes   Bacilli   Bacillales         Bacillaceae         Bacillus         Streptococcus_pneumoniae

The LCA seems to be 1845768359 (Bacilli), without a doubt.

BTW, the process could be simplified.

    seqkit seq -n SILVA_138.1_SSURef_NR99_tax_silva.fasta.gz \
        | grep -v 'Archaea\|Eukaryota\|unidentified\|metagenomes\|environmental samples' \
        | sed 's/ /\t/' \
        | sed 's/;/\t/g' \
        | awk -F '\t' 'NF > 7' \
        > silva_taxonomy.tsv

    taxonkit create-taxdump silva_taxonomy.tsv -O taxdump \
        --force -R "superkingdom,phylum,class,order,family,genus,species" -A 1

shenwei356 / gtdb-taxdump

Silva database? #5