erikrikarddaniel commented 3 years ago

Phylogenetic placement, using Pplacer, is a better way of getting a phylogenetic tree of short amplicons than making a phylogeny from an alignment of the ASVs on their own.

There's a QIIME2 module called q2-fragment-insertion for the purpose, or we can roll our own (see below). The builtin phylogeny in the QIIME2 module is "the 99% OTU Greengenes 13.8 tree with 203,452 tips", but I think one can specify ones own tree. Greengenes is, as you know, outdated.

Rolling our own would allow users to provide an alignment or hmm plus a phylogeny. The builtin would, I suggest, use GTDB species representative genome data. A problem here is that GTDB provides SSU sequences, but not a phylogeny built on them. We would therefore either have to estimate the phylogeny or assume that the protein alignment-based phylogeny is correct and build a reference SSU alignment from the sequences they provide.

Processing outline for GTDB:

Decide which ASVs are archaeal and bacterial respectively. Can be done based on the taxonomic classification I think.
Fetch SSU representative genome SSU sequences for archaea and bacteria respectively.
Fetch the GTDB phylogenies. As far as I can see, only the full phylos are available, so we need to subset to the genomes present in the SSU sequences; this can be done with R.
Align GTDB SSU sequences and ASVs. I suggest we use hmmalign to align to the hmms used e.g. by Barrnap and output only positions that match the profile. *)
Run Pplacer on the subset GTDB phylo and the alignment. *)
Extract a phylogeny and subset to only ASVs

*) Candidates for nf-core modules.

erikrikarddaniel commented 3 years ago

I've done some testing that I'm summarizing here.

Stage the following files: https://raw.githubusercontent.com/tseemann/barrnap/master/db/arc.hmm https://raw.githubusercontent.com/tseemann/barrnap/master/db/bac.hmm https://data.gtdb.ecogenomic.org/releases/latest/genomic_files_all/ssu_all.tar.gz (or the representatives, though quite few archaeal SSU sequences from representative genomes are complete enough). https://data.gtdb.ecogenomic.org/releases/latest/ar122.tree.gz https://data.gtdb.ecogenomic.org/releases/latest/bac120.tree.gz

Extract the 16S hmm from the barrnap multi-hmms and extract archaeal and bacterial respectively from the ssu_all.fna

hmmfetch arc.hmm 16S_rRNA > arc.16S_rRNA.hmm
hmmfetch bac.hmm 16S_rRNA > bac.16S_rRNA.hmm
grep -A 1 'd__Archaea' ssu_all_r95.fna | grep -v '^--' | sed 's/ .*//' > ar122_full.fna
grep -A 1 'd__Bacteria' ssu_all_r95.fna | grep -v '^--' | sed 's/ .*//' > bac120_full.fna

Extract archaeal and bacterial sequences from sequences.fasta using taxonomy.tsv to arcq.fna and bacq.fna respectively (R)

Align the GTDB sequences and the query sequences to the Barrnap hmms:

hmmalign arc.16S_rRNA.hmm ar122_full.fna  | esl-alimask --rf-is-mask - | esl-alimanip --rffract 0.5 - | esl-reformat -n afa - > ar122_full.alnfna
hmmalign bac.16S_rRNA.hmm bac120_full.fna | esl-alimask --rf-is-mask - | esl-alimanip --rffract 0.5 - | esl-reformat -n afa - > bac120_full.alnfna
hmmalign arc.16S_rRNA.hmm arcq.fna | esl-alimask --rf-is-mask - | esl-reformat -n afa - > arcq.alnfna
hmmalign bac.16S_rRNA.hmm bacq.fna | esl-alimask --rf-is-mask - | esl-reformat -n afa - > bacq.alnfna

Subset trees to the taxa present in alignments
```
#!/usr/bin/env Rscript
```

library(Biostrings) library(ape)

s <- readDNAStringSet('ar122_full.alnfna') t <- read.tree('ar122.tree.gz')

dl <- t$tip.label[!t$tip.label %in% names(s)]

write.tree(drop.tip(t, dl), 'ar122_full.newick')

Repeat for bacteria (separate process)

6. Run [EPA-ng](https://github.com/Pbdas/epa-ng) to place queries. (Supposedly faster, and I could get it to work.)

epa-ng -t ar122_full.newick -s ar122_full.alnfna -q arcq.alnfna -m GTR+G

Repeat for bacteria (separate process)

The above results in a epa_result.jplace file

7. [Gappa](https://github.com/lczech/gappa) can be used to create a tree with both reference and query sequences:

gappa examine graft --jplace-path epa_result.jplace



Gappa can also be used to create Unifrac, aka Kantorovich-Rubinstein (KR), distances and many other things, so the `epa_result.jplace` should be output to `results`.

d4straub commented 1 year ago

Phylogenetic placement is now implemented in nf-core subworkflow fasta_newick_epang_gappa and in https://nf-co.re/phyloplace. Would you like to attempt to integrate it into ampliseq?

erikrikarddaniel commented 1 year ago

Absolutely. There are a couple of things we should discuss regarding the implementation, particularly with regards to reference trees and alignments. I thought we could discuss that when we meet next month.

d4straub commented 1 year ago

Alright!

d4straub commented 1 year ago

This issue was broken into https://github.com/nf-core/ampliseq/issues/561 & https://github.com/nf-core/ampliseq/issues/562, so I close this one here.

nf-core / ampliseq

Phylogenetic placement of ASVs #236

Repeat for bacteria (separate process)

Repeat for bacteria (separate process)

The above results in a epa_result.jplace file