Progress report of Methylhomies

Hbrewis commented 7 years ago

@ppavlidis @santina

Here is our progress report. Our final commit is b4a363b. Any related files in GitHub are linked within the progress report .md file.

All the best,

singha53-zz commented 7 years ago

@STAT540-UBC/team-methylhomies

Question for you: "PCA_full<-princomp(uncor.dat[complete.cases(uncor.dat),]) # scaling is not necessary for normalized dataset" --> what do you mean by scaling is not necessary for normalized data?

samschaf commented 7 years ago

@singha53 scale=TRUE is an optional argument to the princomp() function; this is used when the input data is not measured on the same scale. Because our data was normalized for the two probe types before PCA, it is indeed on the same scale, so this argument was left out of the princomp() call. Does that clarify your question?

ppavlidis commented 7 years ago

Looks like good progress.

I was struck by the huge first PC in these data - this is usually a sign of either a very big batch or tissue effect, or a failure to correctly scale the data.

Generally I scale both the rows and columns of my data matrix before PCA/SVD. (I'm used to using svd() for this, not princomp(), at least in recent times). I discussed this in lecture.

The p-values you plot for associations of factors with PCs isn't what I'd use: I'd want to know the correlation, since the p-value is affected by sample size and "significance" isn't what we're after in PCA exporation. The magnitude of the correlation is more interesting. I also usually find that the sample-sample correlation heatmap helps me understand the data, at least as an adjunct to PCA.

It's unexpected that tissue does not appear associated with any of the first PCs, since tissue is known to have big effects on methylation (sort of like gene expression does). I suggest looking into this more. I'd similarly expect you might see cell type proportion estimates correlated with a PC (maybe).

samschaf commented 7 years ago

@ppavlidis Thanks for your feedback! Definitely agree with the point that cell/tissue not showing up in the PCA was concerning; it's something we are currently considering, and the opposite of what we expected. We are actually planning to redo some of the normalization using another method which we just discovered would be more stringent in this case, so maybe it will help.

As for the p-value associations, we were using this since it is part of the standard workflow for methylation analysis employed by the Kobor lab. Good point that relying too much on p-values can be an issue too. We can certainly add in a heatmap as well...

samschaf commented 7 years ago

Also re: scaling for PCA, others in our lab have compared normalized w/o scaling vs non-normalized w/scaling, and found that they produced the same output

ppavlidis commented 7 years ago

Are you saying they tried scaling of the rows or columns? I don't see that you scaled the rows (CpG-wise) and I'd be surprised (but interested) if it doesn't matter. It definitely matters for gene expression data; and it seems to me it should matter for DNAme (though I don't recall trying this myself).

Longer explanation:

I find it easier to think about the SVD on our tall skinny data matrix. The left singular vectors are the principal components with respect to the columns (and they form a basis for the samples; each sample is a linear combination of the left SVs). If the rows have different means, then that information has to be captured, and in my experience it's in the first PC. If you examine the first left SV it will be basically filled with the mean of each row (or a value proportional to the mean).

The eigenvalue associated with that vector will be relatively large to the extent that the variance is higher across rows than within rows. For gene expression it's much higher between rows (lowly vs highly expressed genes), for methylation maybe the effect could be smaller but generally if a site has a beta near 1 it will stay there in all samples, vs. one that is near 0 will stay near 0 (that is, the variance of beta per-site is <<1; for m values same idea).

In fact, with expression data if you forget to standardize the rows, it's not surprising if the first PC captures over 90% of the variance. For methylation I wouldn't be surprised if it's a bit less - and that comports with your result.

If you don't understand all that, that's okay - just try standardizing the rows before you do PCA :)

If you already did, sorry to bother you! But it's good to verify because your result looks exactly like what I'd expect if you forgot to scale the rows.

samschaf commented 7 years ago

I'm not actually sure whether the rows or columns were scaled during that test -- will have to look more into the specifics of how the call in the princomp() function works! And yes, did notice the first PC looking too large as well...I echo your sentiments of feeling these PCA results were slightly fishy overall

ppavlidis commented 7 years ago

I was just checking to be sure, so I believe the following is accurate, but buyer beware:

princomp and svd do not do the scaling for you, you have to do it with stuff like t(scale(t(data))) first.

prcomp can do it if you pass in the data transposed from the usual tall-skinny, like prcomp(t(unscaled.dat), center=T, scale=T)

If you don't transpose first, prcomp doesn't do what we want - because it's scaling the columns.

Trivia:

The reason functions like prcomp seem to want to do stuff to columns rather than rows is because in "regular" statistics it's more natural. Usually our genomics data matrices are presented transposed from what we learn in statistics - the rows of our genomics data are the "variables" but in statistics we're usually shown tables with variables in columns. It's R's "mindset" that variables are columns.

samschaf commented 7 years ago

Okay that makes sense, thank you for looking into it -- will try with transposing and then scaling when we re-run this

santina commented 7 years ago

I don't have many comments on the progress report, except what Paul already pointed out with the PC plots.

Other comments:

What does neural proportion density mean? :P How do I read that plot?
It'd be great if you have knitted your Rmarkdown files into more digestible markdown files. That way we can see how you're running Combat and PCA and give you feedback on where it might have gone wrong with the PCA (though in this case, it looks like Paul and you have figured it out).
Team Bloodies is doing very in-depth methylation analysis by looking at what enhancers and promoters associated with differentially methylated in different lineages of hematopoietic cells. Maybe you already have a solid plan, but if not talking to them will help you with your analysis step on "researching functional biological implications of the identified regions"

samschaf commented 7 years ago

@santina The density plot shows the distribution of neuronal proportions in the data after it is broken down to predicted neuron/glia ratios. In this case neuronal proportion is a value between 0 and 1 - i.e. a proportion of 0.3 means 30% neurons and 70% glia in that sample. So the density plot shows us on average, what % neurons the data contains.

samschaf commented 7 years ago

@ppavlidis tried princomp() with transposed data and scaling and it still looked the same as without scaling!

samschaf commented 7 years ago

However we went back and retrieved a very similar dataset, normalized by another method more appropriate to us (QN + BMIQ instead of dasen), and this makes the PCA look quite different -- big tissue and cell type effects in the uncorrected data and the AD effect was smaller. This seems reassuring as it was more along the lines of what I expected

ppavlidis commented 7 years ago

Interesting, I guess I was wrong! I'm surprised normalization choice has such a huge effect on the data, since all of those methods are supposedly "OK". I'm not used to it being a crucial choice.

I'm curious enough to want to look for myself to understand what is going on. Can you give me the two data matrices? One way would be for me to clone your repo but if it isn't going to be easy to get working at this state (for me), maybe you can use saveRDS to generate files I can get? If it's a hassle nevermind, I'm just curious.

Glad you figured it out.

samschaf commented 7 years ago

@ppavlidis the files are quite large (>1GB) and we have been passing them between each other via USB so not sure if it would be easy to pass them to you? If you still want to check it out though I could USB transfer them to you after class tomorrow

ppavlidis commented 7 years ago

If you have the files on a stick tomorrow I'll have my laptop so I can grab them, but it's not a big deal - just satisfying my curiosity, purely selfish reasons - no need to bother if it is at all inconvenient.

samschaf commented 7 years ago

I returned to my PCA scripts to pull out the correlations rather than p-value as you suggested, and edited how I was scaling; this time it did make a difference (PCs size was more evenly distributed regardless of correction method) so perhaps I was doing it incorrectly at first. However now I don't understand the covariate associations popping up. It seems in the uncorrected data, variance is most highly associated with row and Tissue. Batch correction made the correlation coefficients much larger for everything except Neuron and age.brain, and cell type correction didn't change this much. My source code is here - a rough non-knitted version.

Am I interpreting this correctly that a larger correlation coefficient means this covariate is more closely associated with this PC? And if so, why would batch correction show this effect? When I had it done by p-value the pattern was a lot closer to what I expected, i.e. big batch effects existed at first, and it appeared I had successfully removed these and removed cell type effects as well.

ppavlidis commented 7 years ago

I can't really tell from your code (at a glance) but yes, if you do batch correction, you should (pretty much) lose any PCs that are correlated with batch. What you are reporting does indeed suggest a problem but I can't tell what it is without more work.

samschaf commented 7 years ago

I found the error in my code - changed it to correlation for the continuous variables but not categorical, so those plots were actually showing a mix of p value and correlation which is why it looked so strange. Also Rachel says hi :)

ppavlidis commented 7 years ago

Hi back to Rachel!

When I do PCA on the non-scaled data, the first PC is indeed accounting for well over 90% of the variance. After scaling this goes away. That settles my original source of curiosity, way up-thread.

About the rest, in GSE43414_batch_cor I don't see any obvious association of the first PC with batch as expected, and the first two PCs are associated with Tissue, with cerebellum standing out, as I'd expect. (I assumed you used "chip" as batch since there was nothing else obvious)

That's as far as I got. I used SVD so I'm looking at the columns of the 'V' matrix and the values in d are proportional to the square roots of the eigenvalues of the data covariance matrix.

Anyway it sounds like you figured it out yourself.

By the way, the way you had the data set up made it a little harder than it needed to be:

Your metadata is character and numeric - often easier to just store as factors.
The row names of meta should be the column names of your main data (which were matrices, not dataframes). But as column names in a data frame, you can't start them with numbers (as the case for your barcode) so R prepends an "X", just have to add that using e.g. paste("X", ..., sep='')

samschaf commented 7 years ago

Okay glad we are on the same page now. I used both chip and row for batch, correcting for row first followed by chip. Thanks for the tips re: organization, we've found ways to restructure/work around as needed but will keep this in mind for future!

samschaf commented 7 years ago

@ppavlidis I am trying to adapt the existing PCA heat scree code to show all correlations rather than ANOVA p-values from the categorical variables, but am having some trouble. I looked into the polycor package and using the polychor() function. Running this took longer than overnight, probably due to the fact I didn't give it a contingency table as input. And I don't even know if this is the correct approach in this case.

Anyhow, what I'm wondering is whether in your opinion, it would be worth it to continue troubleshooting this when it would likely show the same pattern as p-value and when we are trying to wrap up everything else for our poster?

ppavlidis commented 7 years ago

Probably not worth it.

samschaf / team_Methylhomies

Progress report of Methylhomies #4