hansenlab / minfi

Devel repository for minfi
60 stars 70 forks source link

A/B compartment agreement across array platforms (450k and EPIC) and multiple datasets #215

Open santanalele opened 3 years ago

santanalele commented 3 years ago

Hi!

I am a PhD student and have been using the compartments function in minfi (and as reported in your 2015 paper) and have some questions which came up as I was working through my analyses.

Here's some background information:

I've investigated the reproducibility of compartment prediction from Illumina DNA methylation arrays data with different datasets, including blood, fibroblasts, EBV-transformed lymphocytes and skeletal muscle. Some of these were the datasets used in your paper. The compartments function was used to generate compartments from both 450k and EPIC array data.

Firstly, I tested whether my results were reproducing the ones in the paper. I noticed, through visual inspection, that fibroblasts (GSE52025) and EBV (GSE36369) compartments seemed to agree in open/closed assignments, and therefore that I was executing the code correctly. However, with blood (GSE54882) there appeared to be an inversion of the compartment prediction - sign inversion such that swapped compartments. Figure below shows comparison between mine (top compartments) and paper results (bottom compartments). For blood data, there are three plots: my results on top, your results in the middle, and the bottom plot showing compartments after a "manual" inversion.

reproducibility_ebv_compartments reproducibility_fibroblasts_compartments reproducibility_blood_compartments

Then I generated compartments with other datasets and compared results. I've got particularly intrigued by a dataset (GSE86833) which assayed methylation in blood (n = 5) and fibroblasts (n = 3) in both 450k and EPIC arrays, from the same bisulfite conversion ("matched arrays"). When comparing compartments of these data with compartments generated from other data for the same tissue, it appears that some are "inverted" (highlighted in following figure).

plot_chr14_compartments_epic_450k_inversion_highlight

Would you be able to comment on how does the comparison within tissue (i.e. blood vs blood, or fibroblast vs fibroblast) work for 450K and EPIC? With respect to fibroblast vs fibroblast (same platform, different datasets), are these observed differences expected?

Considering that EPIC arrays have almost twice the number of probes of that in 450k, I tested whether only between-arrays shared probes would then allow an agreement between matched-arrays (GSE86833). Indeed by filtering only shared probes, these matched-arrays presented higher agreement in compartment assignment. So, I've applied the same probe filtering in all other datasets, re-generated compartments, and noticed that there are still differences within the same array platform and same tissue for the fibroblasts samples (as shown below).

shared_compartments_chr14_fibroblasts_only_highlights

Finally, I tried to understand the sign inversion step by assessing whether these inversions could be happening as a result of a small correlation value (negative/positive values close to zero) dictating an "incorrect" inversion. However, I couldn't find an explanation/pattern.

I would be really grateful if you could provide some help/guidance on whether I am doing something wrong.

Appreciate everyone's time taken to read/help with this query!

Kind regards,

Alessandra

kasperdanielhansen commented 3 years ago

This is a quick reply to something that is really thoughtful, but I thought it was better to reply quickly, than to think too much and then never reply.

  1. Sign (chromosome-wide): the sign of an eigenvector is not unique. Ie if e is an eigenvector, so is -e. For this reason, you can just change the sign to whatever you want - as long as you change it for the entire chromosome. People (incl us) have various ad-hoc solutions to setting the sign, for example if you have a compartment vector from another experiment, you can compute cor( compartment, e) cor (compartment, -e) and pick whatever sign gives you positive correlation.

  2. We did not have good results in our paper with a blood dataset. We don't know why? Is blood special? Is it more affected by cell type composition between samples? We really don't know, but the blood vector you show is pretty rapidly changing compartments. Is that true, or is it because of some issue? I'm just asking rhetorical questions here, but reminding you that we did not get it to work on the blood dataset we looked at.

  3. The method is based on correlation. I would be very skeptical of using such a method with very small sample size (say n=3 or 5). We don't know how little the n can be, but I just think 3 or 5 is too little.

  4. Your last example - of a sign inversion happening locally - is new to me. I have no good explanation.

  5. We have a method in minfi for dropping the non-450k probes on the EPIC array.

  6. I am a bit surprised that you need to drop EPIC probes to get the same result. My first thought is sample size, but granted, that's just guesswork. Would be nice with more samples though.

On Mon, Dec 21, 2020 at 4:51 AM Alessandra Santana notifications@github.com wrote:

Hi!

I am a PhD student and have been using the compartments function in minfi (and as reported in your 2015 paper https://genomebiology.biomedcentral.com/articles/10.1186/s13059-015-0741-y) and have some questions which came up as I was working through my analyses.

Here's some background information:

I've investigated the reproducibility of compartment prediction from Illumina DNA methylation arrays data with different datasets, including blood, fibroblasts, EBV-transformed lymphocytes and skeletal muscle. Some of these were the datasets used in your paper. The compartments function was used to generate compartments from both 450k and EPIC array data.

Firstly, I tested whether my results were reproducing the ones in the paper https://genomebiology.biomedcentral.com/articles/10.1186/s13059-015-0741-y. I noticed, through visual inspection, that fibroblasts (GSE52025 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE52025) and EBV ( GSE36369 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE36369) compartments seemed to agree in open/closed assignments, and therefore that I was executing the code correctly. However, with blood (GSE54882 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE54882) there appeared to be an inversion of the compartment prediction - sign inversion such that swapped compartments. Figure below shows comparison between mine (top compartments) and paper results (bottom compartments). For blood data, there are three plots: my results on top, your results in the middle, and the bottom plot showing compartments after a "manual" inversion.

[image: reproducibility_ebv_compartments] https://user-images.githubusercontent.com/17224514/102734784-8a118180-43a5-11eb-8818-301d583f4698.png

[image: reproducibility_fibroblasts_compartments] https://user-images.githubusercontent.com/17224514/102734822-a7465000-43a5-11eb-8fb2-d83cb0506e3c.png

[image: reproducibility_blood_compartments] https://user-images.githubusercontent.com/17224514/102734839-b3321200-43a5-11eb-9c0b-4b37e4434413.png

Then I generated compartments with other datasets and compared results. I've got particularly intrigued by a dataset (GSE86833 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE86833) which assayed methylation in blood (n = 5) and fibroblasts (n = 3) in both 450k and EPIC arrays, from the same bisulfite conversion ("matched arrays"). When comparing compartments of these data with compartments generated from other data for the same tissue, it appears that some are "inverted" (highlighted in following figure).

[image: plot_chr14_compartments_epic_450k_inversion_highlight] https://user-images.githubusercontent.com/17224514/102735383-330cac00-43a7-11eb-9661-9d6dbf5bd9f7.png

Would you be able to comment on how does the comparison within tissue (i.e. blood vs blood, or fibroblast vs fibroblast) work for 450K and EPIC? With respect to fibroblast vs fibroblast (same platform, different datasets), are these observed differences expected?

Considering that EPIC arrays have almost twice the number of probes of that in 450k, I tested whether only between-arrays shared probes would then allow an agreement between matched-arrays (GSE86833 https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE86833). Indeed by filtering only shared probes, these matched-arrays presented higher agreement in compartment assignment. So, I've applied the same probe filtering in all other datasets, re-generated compartments, and noticed that there are still differences within the same array platform and same tissue for the fibroblasts samples (as shown below).

[image: shared_compartments_chr14_fibroblasts_only_highlights] https://user-images.githubusercontent.com/17224514/102735967-9f3bdf80-43a8-11eb-80d4-5b7090978390.png

Finally, I tried to understand the sign inversion step https://github.com/hansenlab/minfi/blob/master/R/compartments.R#L288 by assessing whether these inversions could be happening as a result of a small correlation value (negative/positive values close to zero) dictating an "incorrect" inversion. However, I couldn't find an explanation/pattern.

I would be really grateful if you could provide some help/guidance on whether I am doing something wrong.

Appreciate everyone's time taken to read/help with this query!

Kind regards,

Alessandra

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/hansenlab/minfi/issues/215, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABF2DH244SEVDRVIVS4U4DDSV3A4XANCNFSM4VDUU3OQ .

-- Best, Kasper

PhilJur commented 3 years ago

Dear Alessandra,

some time has passed since your initial post and you might have found a solution in the meantime, but I would like to bring up this topic again. Unfortunately, I cannot really help clarify this issue. I have recently started to implement the compartments-function into my workflow and stumbled upon your thread as I plan to investigate combined 450k and EPIC datasets as well. I tried to recreate your problem using a dataset of 450k (n = 40) and EPIC (n = 12) olfactory neuroblastomas and the results for the merged, 450k-only and EPIC-only datasets are almost identical. Chr22 of my analysis is depicted below - I highlighted the few and reeeally subtle differences, otherwise they are impossible to spot.

ONB

Furthermore, I did not have to drop EPIC specific probes. I think this might support the theory that this is caused by small sample size or that it is a blood-specific problem?

I don't want to hijack the thread, but I observed something odd when I tested this out. I noticed that I see a perfect 50:50 distribution of predicted closed and open chromatin sites on every chromosome and every dataset that I have tested so far. I have checked the example dataset by Jean-Philippe Fortin (https://github.com/Jfortin1/TCGA_AB_Compartments/tree/master/data) and this does not seem to be the case for the TCGA datasets he has processed (e.g. PRAD 8883 closed vs. 12601 open; BRCA 10116 closed vs. 12076 open). As I mentioned above, I have only recently started using this function, so I might be completely mistaken here, but I don't that this is expected, is it? I observed this in 450k-only, EPIC-only and also in the merged dataset. I also tried different preprocessing techniques and filtering steps, but the results are pretty much the same. Could this be a sample size issue as well? My total sample size was usually between 20 and 60.

Thanks for everyone's support!

Best Philipp