snystrom / cutNrun-pipeline

Analysis pipeline for CUT&RUN Data. Currently configured for use in the McKay Lab at UNC, but extensible elsewhere by changing configuration files.
1 stars 7 forks source link

Stop calling MACS2 peaks from ControlDNAPath. Use sample-specific inputs, or MACS default background adjustments. #29

Closed snystrom closed 4 years ago

mniederhuber commented 2 years ago

@snystrom just curious, why is a standard control DNA file not a good fit for CnR peak calling?

snystrom commented 2 years ago

A couple reasons. First, the control DNA file is from sonicated gDNA, so not an accurate background representation of CnR signal or even what background CnR coverage would look like.

Second, that file is so old, it's 25bp single end sequencing from an Illumina Genome Analyzer II from like 2012. Chris generated a bunch of IgG inputs in wing discs before he left using Kami's Ampure-based CnR protocol that you could use. I also did some using PCI if you can find those data (sequenced around March 2021, I think).

snystrom commented 2 years ago

I also did some testing by comparing macs calls from the gDNA background, IgG, and no control, and thought the gDNA background-called peaks underperformed, but that was just by eye. I think Chris did a more formal analysis of this, too. And compared to SEACR it seemed to give better calls, too.

BenjaMcM commented 2 years ago

@snystrom @mniederhuber Hey guys, if gDNA isn't used to control for background, is no control acceptable? Or should I use IgG data instead? If so, would you mind sharing the IgG data? I have been doing my first CnR analysis (using this Snakemake) and have run the analysis with the gDNA background and without. I'm not sure which results to use and want to ensure I am following best-practices. Thanks!

snystrom commented 2 years ago

gDNA isn't a great control for cut&run, that is a bit of a holdover from our FAIRE pipeline. IgG is the "best practice" approach but if you don't have it, fine to just use no background with the caveat that some of your peaks might not be real.

BenjaMcM commented 2 years ago

@snystrom Ok cool, thanks! I'm a member of the Matera lab doing quite a bit of computational work, so I may have future questions as well!

snystrom commented 2 years ago

Oh also, this is kinda buried in the readme but:

"Note: calling peaks with MACS2 against a controlDNA file is probably not the best control for CUT&RUN data. This is a holdover from other FAIRE and ATAC pipelines we kept to account for copynumber variation, etc. The controlDNAPath setting is therefore not required per se. If you wanted to leave this out for now, simply set the value to "" then set the -k flag in slurmSubmission.sh and accept that the peak calling step will fail for all samples. Alternatively, edit the peak calling step. (See https://github.com/snystrom/cutNrun-pipeline/issues/29)"

snystrom commented 2 years ago

Oh huh, I dunno why this issue is closed that commit didn't close this issue, I must've mistyped the issue number in that commit. Check the most recent fork of this pipeline (see the 'forks' section of this repo to see the McKaylab version that is derived from this one). I might've still fixed this, but am on mobile currently so can't dig into the code to confirm.

BenjaMcM commented 2 years ago

Awesome, will do. Thanks again!

mniederhuber commented 2 years ago

@jemcpher recently tested different control files with macs2 And from that test it looked like using the current genomic DNA file ("sheared DNA" in that fig) was noticeably less stringent that no control.

image

mniederhuber commented 2 years ago

@BenjaMcM you might also want to use / follow the pipeline here https://github.com/mckaylabunc/cutNrun-pipeline This is now a few commits ahead of @snystrom 's and where any new features will be added.

BenjaMcM commented 2 years ago

@mniederhuber @jemcpher Oh wow, that's interesting that the sheared DNA was even less stringent. Good to know! In this browser shot the IgG looks almost identical to the no control samples. However based on the total peak #, the IgG is also less stringent than the no control correct? Or am I misinterpreting something?

I double-checked which version of the pipeline I cloned and it was the one above from mckaylabunc. I was wondering how I ended up on @snystrom page but the the README linked to this Issue in the genome_config section.

Thanks again everyone!

snystrom commented 2 years ago

The intuition for why the sheared DNA is less stringent is because it has essentially 1x coverage at all sites in the genome, except high copy number loci. Using even coverage as the background set is essentially as liberal as you can be (the exception being threshold peakcalling, which is also implemented here in callThresholdPeaks.R.

No control is using MACS's local background estimation (dunno the specifics, obviously since I am long dead to the McKay lab), which depending on how it is configured could wind up being more "stringent" than the IgG, since at least in my experience, I had relatively low signal from IgG, except it would have these "phantom peaks" that get masked out when using it as background. It's hard to say from the raw numbers whether the stringency is too high, instead looking at peaks & signal in peaks in the browser like JM did is the best way to determine if the ones you're getting are high quality. For instance, if you compare the peak calls for the IgG and the sheared DNA controls, they actually look quite similar, and in fact they look the "best" to me. But I bet if you looked at the difference in peak calls between IgG background and sheared DNA background, you would see a bunch of peaks called as "real" in the gDNA background that are actually the "phantom IgG" peaks.

BenjaMcM commented 2 years ago

@snystrom Got it, makes complete sense! Thanks for the explanation, that helps quite a bit.

mniederhuber commented 2 years ago

I'm playing around with processing some CnR data currently and in addition to fiddling with duplicate filtering I'm going to test different controls for peak calling. I'll add any relevant observations here. @snystrom you're not dead to us, we have a small gum effigy that we keep in a cabinet and make offerings too

BenjaMcM commented 2 years ago

@snystrom @mniederhuber Kind of off-topic for this issue but still regarding normalization, why are bigwig files either normalized to a spike-in or by RPGC but not by both?

mniederhuber commented 2 years ago

@BenjaMcM my understanding is that the two approaches attempt to normalize for the same thing (library prep, read depth) in different ways. rpgc and rpgc+zNorm use the number of reads and genomic background, while spike-in normalization uses reads aligned to a standard amount of heterologous DNA in each sample. The idea behind spike-in is that since CnR has such low background there may not be sufficient "background" information for standard normalization to be accurate.

Spencer can maybe confirm or deny, but from what I've seen in our lab, the spike-in doesn't seem to be appreciably better or worse than standard methods.

snystrom commented 2 years ago

There is also an error in the spikein normalization apparently that Alex found a fix for, so I don't know if that makes it more useful. But I agree when I've done it in a way that I am sure was right, I couldn't really convince myself one way or the other that the spike in was working correctly. We never did controls to check spikein performance like making artificial gDNA libraries with different DNA ratios or anything either, so all of that work is sort of confounded by not having a true demonstrable situation where we know the ground truth.

snystrom commented 2 years ago

To answer the original question, there is no advantage to performing RPGC normalization on spike-in normalized data. These are both methods that scale the read depth using a global scaling factor. For RPGC, that is dividing read depth at each position by the genome size, and for spike-in normalization it is by dividing by spikein read count (we implement it in the pipeline as a multiplication of a scaling factor, but this is the same thing).

You can RPGC normalize spikein normalized values, but the linearity between samples is unaffected. You can prove this to yourself rather easily:

RPGC = read-depth / genome size spikeNorm = read-depth / n_spikein_reads

Therefore: spikeNorm-RPGC = (read-depth / n_spikein_reads) / genome size

Because genome size is invariant between two samples, the relationship between two samples with varying spikein read counts will scale identically with or without the RPGC normalization.

i.e.

(spikeNorm_sampleA / spikeNorm_sampleB) == (spikeNorm-RPGC_sampleA / spikeNorm-RPGC_sampleB)
BenjaMcM commented 2 years ago

@mniederhuber @snystrom Great explanations! I keep underestimating how much of an impact the lack of background noise in CnR data has on normalization methods. I also didn't understand how the normalization to a spike-in was actually being performed (I assumed by a scaling factor, but I wasn't sure how the scaling factor was calculated), but this clears it up. Thanks as always!

snystrom commented 2 years ago

To clarify, lack of background signal is not a reason underlying the differences in spikein vs RPGC. I should also mention that there are two "spikein" types that are commonly used in CUT&RUN. The first is adding spikein DNA following the MNase digest. In the original protocol, this was done by adding a bit of yeast gDNA into the reaction stop buffer, and in more recent literature the Henikoff lab claims that residual E. coli DNA leftover from the original MNase prep can also be used for this purpose. All this type of spikein is able to control for is differences in library prep efficiency and potentially DNA release from the sample post digestion. It cannot control for differences in antibody performance between samples. Bringing us to the second type of spikein method: adding cells or tissue from another organism in a defined ratio relative to the cells of interest at the start of the assay. This method requires that the epitope of interest be expressed in the spiked in celltype as well. This controls for differences in antibody performance between samples.

The normalization formula for either approach is the same, but the transformation has different implications for how samples can be compared. In my hands, the post-digest type spikeins tend to only amplify differences in read depth. JM is a good person to talk to about the orthogonal tissue spikein work. We had some hints it was working by the time I left, but I don't know how it ended up.

BenjaMcM commented 2 years ago

@snystrom That actually brings me to another question/concern. Our lab prepared samples by adding yeast gDNA and yokuba wing-discs as you describe (our samples of interest are wing-disc, hence the yokuba wing-disc tissue). In the pipeline, I set multiple spike-ins for normalization (yeast, e. coli, and yokuba) creating a combined genome index for bowtie. I had hoped to compare normalized outputs to determine which method appeared to work "best" on our dataset. However, when viewing a samples normalized bigwigs, they are identical regardless of the organism used for normalization. The only bigwigs that differ are the rpgc normalized files.

Originally I thought this was due to an error in splitting the combined genome into each respective organism (i.e. the files were being normalized to a total spike-in read count of all spike-ins) however the unnormalized bigwig (_allFrags.bw file) was also identical to the normalized files.

Do you have any thoughts as to what could be occurring? I know this is more of a technical concern and you haven't worked on the pipeline in awhile so this may be more suited to @mniederhuber.

snystrom commented 2 years ago

I think this is the bug alex discovered. Strangely, if you run the command found in the snakefile interactively it works as expected. or at least that's what Alex found (#44 for more details, although it's not really descriptive).

BenjaMcM commented 2 years ago

@snystrom I was wondering if it was the same issue Alex had. I wanted to make sure it wasn't user error before I went around calling it a bug 😂. I'll ask Alex if she has a patch for it!

snystrom commented 2 years ago

If you get the patch please pass it on, I'm very curious why it fails.

BenjaMcM commented 2 years ago

Absolutely, will do!

mniederhuber commented 2 years ago

Sorry for the delay. Asked Alex about it. This is still an unresolved bug that has not been fixed. Alex has moved away from using spike-in normalization and so this issue has gotten lost in the shuffle. Trying to find out if we have any more details about what the cause is.

snystrom commented 2 years ago

If you have Alex's copy of her Snakefile you could try diffing it & see if the spikein step is changed.

BenjaMcM commented 2 years ago

After discussing here as well as with Alex, we have decided against using any spike-in normalization and plan to stick with RPGC normalized files (at least for now). However I am also performing a differential peak analysis with featureCounts and DESeq2, using MACS2 peak files to create the featureCounts annotation for unnormalized bams. We are planning to use genome browser shots of RPGC normalized bigwigs, and I want to ensure that the genome browser shots and the differential peak analysis are using similarly normalized data to keep everything consistent.

Since we provide MACS2 with the -g flag, is MACS2 performing RPGC normalization? If not, would it make sense to RPGC normalize bed (or bam) files, and then run MACS2? The end goal here simply being to create 'normalized' peak files that could be used to generate the featureCounts annotation which would match the bigwig browser shots. Since DESeq2 prefers unnormalized data, I would still use unnormalized bams as the featureCounts input, but would have annotations for normalized peak coordinates.

Hopefully these questions make sense but happy to clarify if not!

snystrom commented 2 years ago

There's no advantage to provide additional normalization beyond what the MACS algorithm is already doing. Since you are doing the MACS-> DESeq2 thing, you can think more of MACS as a filter for regions of interest, then DESeq as the determinant of significance. In practice, you may also find that doing some prefiltering of your peak calls is useful to limit the results further to peaks that look real. Importantly, you want to use the same number of peaks from each condition to generate the union peak set. You can see an example of this approach implemented here, where I used the MACS q-value as a filter. Tbh I wouldn't suggest that as a good default approach in the future, but it did what I wanted for those data.

Finally, I think using peaks called from pooling reads for all replicates is what you really want to do here since you're trying to cast a wide net. This will be the most sensitive approach for peak detection for each condition. You will have to assess for yourself if the read depths between conditions are highly imbalanced, in which case I might suggest deeper sequencing. For instance, if you have two conditions and one has 1/2 the peak calls but also 1/2 the total number of reads, that's a sign that you need more depth.

Hope that helps.

BenjaMcM commented 2 years ago

Very helpful!

I have been using peaks called from pooled reads so at least that's already taken care of 😂. Our main concern is one sample that has a similar peak profile (and peak count) to the other reps but the signal is much lower. In terms of genome browser shots this is somewhat accounted for by RPGC normalization, but then I wanted to be sure we were properly normalizing any inputs to featureCounts/DESeq2 as well. In theory, being that MACS2 calls peaks using relative signal I'm not even sure that normalizing before using MACS2 would make much of a difference anyways.

Regardless, we just wanted to be sure that we weren't going to raise any eyebrows by calling differential peaks on unnormalized files and then showing RPGC normalized browser shots, which it seems is not a problem.

Thank you again!

mniederhuber commented 2 years ago

Sorry for the slow response. To echo Spencer, macs isn't doing any kind of normalization to the data. In our pipeline the effective genome size is used to model the likelihood of local fragment occurrence - lambda-bg and lambda-local.

macs does linearly scale treatment to control (or to whichever data set is "smaller"), but otherwise I don't think you would want to use any kind of normalized data when calling peaks.

Same goes for deseq, I would rely on the software to determine significant enrichment between conditions and not pre-normalize any data. But i'm less versed in the inner workings of deseq.

I think normalization is most relevant when visualizing differences in conditions in a browser. Or perhaps when plotting relative signal with heat maps (though some software like deeptools will take unnormalized bam files to do that kind of analysis.

BenjaMcM commented 2 years ago

Makes sense, thank you!

I've been reading more and more on macs and no matter how much I read I still feel like it's a "black box". This is really helping to clarify it's inner-workings.