ATAC-seq normalization question

Rafaelsoler13 commented 3 years ago

Hi Jake!

I have read your paper on the ATAC-seq normalization process using 8 different methods, and I think that is a very interesting paper, I love it! However, I was left with an important doubt when finishing it.

As you mention in the paper, it all depends on the experimentation you are doing and the expected results. If you are comparing two conditions and you expect that in one the chromatin is more accessible than in the other, applying a stricter normalization method such as loess, which "assumes a symmetric global distribution in which there are no true biological global differences in ATAC reaction efficiency or distribution ". Therefore, I have doubts about which normalization method to use at the end. What option do you think would be more convenient? (If it is not among these, I also listen to other options).

Always normalize using stricter methods such as loess and even if we have type II errors (not as problematic as type I), and thus obtain a robust result.
If we have an RNA-seq to validate the data, perform the normalization that most correlates with the ATAC-seq data regardless of what we biologically expect.
Do all the normalization methods, and keep the distribution that biologically makes the most sense according to our experiment.

Thank you very much for reading me and we will keep in touch! Regards and thanks for everything 😁

reskejak commented 3 years ago

Hi Rafael - thanks for reaching out, and glad you appreciate the resource.

You have brought up some important questions without a clear answer. We had these similar ideas initially, and we felt they were so important that we should write this manuscript!

First, we specifically did not want to select a "best" method, because it is probably dependent upon experimental design, as you stated. However, I will say that the typical experimental designs that I and colleagues encounter make sense to me to normalize with loess or quantile based on the biological assumptions. I cannot hypothesize many biological scenarios when one would truly expect to observe global increasing or decreasing chromatin accessibility at virtually every regulatory site in the genome; the only one that possibly comes to mind would be during zygotic genome activation (fully inactive genome --> early activation). Or, in the case of ChIP-seq, perhaps if you had a drug that inhibited polycomb H3K27me3 deposition and wanted to measure genome-wide H3K27me3 levels at a high drug dose vs. control, then you might expect to see global loss. But, I will say, we saw nice concordance with gene expression and other features using loess normalization in a recent ChIP-seq experiment where we expected slight global loss of the antibody target (PMID: 33176148, Figure 5B far right MA plot). To me that really exemplified that, while quantile and loess appear to be highly conservative normalization methods, they can be sensitive to mostly uni-directional changes, provided that most of the tested regions are expected to be unchanged. This assumption is actually similar to TMM, though I have found that TMM does not capture/correct trended biases that are different in high vs. low signal regions.

Honestly, I think all 3 strategies you have listed are appropriate. In my personal experience with various ATAC-seq and related data sets, I have found loess to be an appropriate "default" (for a two condition differential comparison) based on my assumptions and technical considerations, but you may want to try exploring other methods your first time around. If I were to analyze a large cohort of e.g. 100+ samples of various conditions and underlying biological/technical confounders, I would probably choose to interpret the quantile-normalized measurements. As stated, RNA-seq (and other orthogonal assays) are really useful tools to interpreting possible ATAC changes. As we saw in the paper, RNA and ATAC changes are not 1:1, but we expect they should associate somewhat. Your number 2 suggestion may not give you a clear "best" answer, because we also saw that the FDR threshold selected can dramatically affect the RNA overlap. But you should expect that a robust ATAC analysis set of differentially accessible peaks should probably overlap with gene expression changes, compared to poorer normalization methods. I am a bit hesitant to suggest to simply look at the MA plot distribution (as you suggest in 3) and make a selection off that, but I do think this is incredibly useful for diagnosing potential errors in the data. If you have an experiment, say cell line protein X knockdown vs. control, and you see a clear upward global accessibility trend in the MA plot following a TMM normalization or CPM transformation etc., then I would absolutely be cautious of interpreting that and suggest to use loess or quantile normalization instead. But in the case where your data look nice and evenly distributed with a simple CPM transformation, then you may not have to worry much about fancier normalization approaches. Food for thought.

Naively, for what it's worth, I choose 1. after seeing enough enrichment data sets myself (ATAC, ChIP, CUT&RUN, CUT&TAG etc.), but 2. and 3. are also useful. For your first data set, I would suggest exploring a bit, even just choose a loess and a TMM normalization in csaw, for example, and compare RNA-seq overlap and the MA plot distributions. Should be helpful in diagnosing if you have substantial bias or not.

Hope this helps!

Best, Jake

Rafaelsoler13 commented 3 years ago

Thank you!! I need help with the ATAC workflow, can I email you to ask you some questions?

Best regards,

Rafa

reskejak / ATAC-seq

ATAC-seq normalization question #5