multiple sequence alignment

joey711 / phyloseq

phyloseq is a set of classes, wrappers, and tools (in R) to make it easier to import, store, and analyze phylogenetic sequencing data; and to reproducibly share that data and analysis with others. See the phyloseq front page:

587 stars 187 forks source link

Hello everyone,

This is my first commit and although I have been searching for answers around the web, I do apologize if this is redundant!

From preliminary 16s microbiome relative abundance data, I find that the microbiomes of some samples are ~90% dominated by the same Family. I am curious if the presence of specific ssv's of this Family are correlated with certain environmental characteristics, and further if there exists a specific pattern in the 16s V4 region that follows such a correlation.

To do so, I would like to subset this Family from all of my samples, align all of its 250bp sequences, and look for the presence of SNPs that may correlate with factors in my environmental metadata (i.e. host genus, habitat type, etc). Also, using Dada2 I have been struggling with getting sequences of = length after filterandTrim, but is it even necessary to have sequences of equal length?

Does anyone have any suggestions on workflows/packages? Or comments/criticisms on this idea? I know that full genomes are probably more telling for this sort of question, but 16s is what I've got!

Cheers and many thanks

ASVs from DADA2 will have length variation due to insertions and deletions through evolution of the region. How much length variation there is depends on primer region. I've mostly worked with V4V5, where there is usually fairly little variation but still is some (e.g., most ASVs are 252 bp, but some are 253 bp).

There are lots of programs and R packages that will do alignment. I find DECIPHER pretty easy especially with their vignette "The Art of Multiple Sequence Alignment in R". I've only used this for then building a phylogenetic tree following the F1000 workflow. Note, trees you build from short amplicons might not be very accurate, but might be sufficient for your purposes.

I think it is more common for people to obtain a phylogenetic tree and do a tree based analysis, rather than a SNP based analysis, for 16S amplicons, though that isn't to say that a SNP based analysis wouldn't be useful. You could then use software such as Phylofactor to look for clades that change between environments.

But you might first try looking at genera or ASVs w/in your Family of interest and seeing if anything pops out. E.g., make a heatmap w/ plot_heatmap() with the samples grouped by environmental category, or use DESeq, or whatever method you like, and look at the results subset to the family you are interested in. There aren't really any rules about how to do this. For example, you could try normalizing to proportions first, then subset to your family, then call plot_heatmap(); or subset to the family first, then normalize to proportions, then call plot_heatmap(). These will be showing you different things---the second might tell you that an ASV is becoming the dominant member in the family in environment 2, but not necessarily dominant overall (b/c the family might be less abundant in environment 2).

joey711 / phyloseq

multiple sequence alignment #1121