Hierarchical Clustering

darogan commented 7 years ago

This looks great Simon. I have been using Bioconductor:seqTools to achieve something similar for fq files. However as that takes an age to run I was wondering if it would be possible to add some hierarchical clustering plus dendrogram to help identify samples that stand out as having different sequence compositions (e.g. Page 15 from http://www.bioconductor.org/packages/release/bioc/vignettes/seqTools/inst/doc/seqTools.pdf). The pheatmap then shows why a particular sample would stand out.

s-andrews commented 7 years ago

Hi Russell - are you wanting to actually get lists of sample ids by splitting the cluster tree which is already at the top of the plot we produce? I'm not sure I'm clear on what you want which isn't covered by the column clustering we do already.

darogan commented 7 years ago

I think what I'm asking for is a simplified version of the heatmap as a dendrogram. Your example.png with two inputs is very clear, but if I'm comparing say 20 input files, wouldn't the heatmap get harder to interpret? I'll run compter on one of my data sets, the seqTools version is pasted below to show what I had in mind

s-andrews commented 7 years ago

So that dendrogram is the same as we put at the top of the plot, but turned 90 degrees and spread out more. I guess I'm still working out the use cases for this, but mostly I've just been looking to see if the different groups end up together or not, I've not tried digging down to the level of individual sequences.

The view you showed above wouldn't actually work for our test data where we have many hundreds of input sequences so you'd run out of vertical space very quickly and wouldn't be able to see the names. In your example, would you just be looking to see that sequences 9,5 and 20 formed their own little sub-group?

s-andrews commented 7 years ago

Going back and reading your original comment properly I think we might be talking at cross purposes! What you're doing and what compter does are quite different.

In your case you're analysing the composition of a whole set of sequences (ie a fastq file) and then you plot out the sequence sets. You're only generating a single set of frequencies for each file.

Compter analyses each separate sequence in the file and calculates a composition for that - it then plots the clustered view of all of the sequences (not sequence sets) but then highlights the original set information so you can see:

Whether all of your sequences in a set behave similarly
Whether you could have predicted your sets by knowing your composition

We've tended to use this more on candidate regions rather than raw fastq data. It's been really useful where we think that results may be coming from a technical bias or we think that composition might be biologically involved in an effect. We've also used it as a pre-analysis before doing any kind of motif detection since if the composition is skewed then motif detection will also likely go horribly wrong.

darogan commented 7 years ago

Yes, sorry I was explaining my idea very badly. seqTools is clustering at a file level, but doesn't visualise which kmers are over/under represented, but this is nicely shown in your heatmap on a per sequence level.

What I was suggesting is a file level summary of the per sequence kmer analysis, although from your use case above its probably outside the scope of compter

Something very simple like the pseudocode below would probably achieve what I had in mind. And the goals would be to:

Identify if there are any differences in overall kmer composition across a set of files
If there are unexpected differences, use the heatmap to identify the bias (e.g. pick a representative file from each cluster).

fileSummary File1: Kmer1Median, Kmer2Median, KmerNMedian File2: Kmer1Median, Kmer2Median, KmerNMedian FileN: Kmer1Median, Kmer2Median, KmerNMedian

d <- dist(as.matrix(fileSummary))
hc <- hclust(d)
plot(hc)

s-andrews commented 7 years ago

So actually a really easy way to do this would be to add a parameter (--perfile for example) which instead of outputting a line for each sequence, created a mean (or median or whatever) answer per file. That way the heatmap would cluster files and you could then find pairs of files which you could look at in more detail.

darogan commented 7 years ago

That sounds like it would work!

s-andrews / compter

Hierarchical Clustering #1