macs3-project / MACS

MACS -- Model-based Analysis of ChIP-Seq
https://macs3-project.github.io/MACS/
BSD 3-Clause "New" or "Revised" License
698 stars 268 forks source link

Q: How to incorporate ChIP-seq spike in normalisation into peak calling? #356

Open tudumanne opened 4 years ago

tudumanne commented 4 years ago

Hello, I have been trying to figure out how to incorporate external spike-in normalisation factors to both single-end and paired-end ChIP-seq data for peak calling with MACS2.

I tried peak calling through sub commands - by scaling ChIP and control samples using 'bdgopt' subcommand based on calculated spike-in normalisation factor. However, at the moment "Advanced: Call peaks using MACS2 subcommands" tutorial provides instruction only for single-end data.

What would be the best way to perform this type of analysis for paired end data?

Thank you very much in advance.

taoliu commented 4 years ago

@tudumanne Thanks for asking the question! When I wrote the step-by-step tutorial to call peaks using subcommands in 2017, MACS2 can't do pileup subcommand on paired-end data. From v2.1.2, MACS2 fully supports paired-end data in subcommand. I will update the tutorial when I get time. In brief, you just need to skip the step2, and in step3, do macs2 pileup -f BAMPE ... or macs2 pileup -f BEDPE ... to get the pileup signal tracks. Then the following steps would be the same.

tudumanne commented 4 years ago

@taoliu Thanks a lot for your reply.

taoliu commented 4 years ago

@tudumanne You are welcome!

olechnwin commented 4 years ago

I feel like my question belong here. Sorry if it isn't.

I'm trying to incorporate spike-in normalization for paired-end as OP. I've done step 1-3, skipping step 2 from the sub commands tutorial.

However, in step 4, after a background noise track of slocal local window was created, the next step is to normalize the 1kb noise by multiplying the values by d/slocal. Since I am skipping step 2, I don't have d. So, my question, what do I put in for d/slocal?

I'm copying the steps I've done so far:

Filter duplicates macs2 filterdup -f BAMPE -i ${prefix}.sorted.bam --keep-dup=1 -o ${prefix}_filterdup.bed macs2 filterdup -f BAMPE -i ${prefix_ctrl}.sorted.bam --keep-dup=1 -o ${prefix_ctrl}_filterdup.bed

Generate pileup tracks for ChIP sample: macs2 pileup -f BEDPE -i ${prefix}_filterdup.bed -o ${prefix}_filterdup.pileup.bdg

Create background noise track:

d background macs2 pileup -f BEDPE -B -i ${prefix_ctrl}_filterdup.bed -o ${prefix_ctrl}_filterdup_d_bg.pileup.bdg

slocal background macs2 pileup -f BEDPE -B --extsize 500 -i ${prefix_ctrl}_filterdup.bed -o ${prefix_ctrl}_filterdup_1k_bg.pileup.bdg

Next come the step to normalize 1kb noise which I'm not sure what to put forp. This is the command from the tutorial: macs2 bdgopt -i 1k_bg.bdg -m multiply -p 0.254 -o 1k_bg_norm.bdg

Thank you in advance for your help!

olechnwin commented 4 years ago

Hi @taoliu,

I have two questions:

  1. Can I get the fragment size (d) from the output of callpeak on the ChIP file? So run the following command, where ${prefix}.bam is the bam file of ChIP sample and ${prefix_ctrl} is the bam file of my control sample and get d from the output? macs2 callpeak -t ${prefix}.bam -c ${prefix_ctrl}.bam -f BAMPE -g hs -B --keep-dup all

  2. In the advance tutorial, step 5 scaling ChIP and control to the same sequencing depth , it was recommended to scale down. So if the number of unique reads for ChIP and control are 199867 and 199583, so the ChIP bias is the one that have to be scaled down, by multiplying with the ratio between control and ChIP which is 199583/199867=.99858. Is this the correct way to scale down the ChIP bias? macs2 bdgopt -i ${prefix}_filterdup.pileup.bdg -m multiply -p .99858 -o ${prefix}_filterdup.scale.pileup.bdg

Thank you in advance for your help!

taoliu commented 4 years ago

@olechnwin

  1. Yes. You can get the d by running macs2 callpeak. Or follow the step 1 and step 2 of the advanced tutorial.
  2. Right.
olechnwin commented 4 years ago

@taoliu

Just want to make sure. The d I was referring above is for paired-end reads. So, I can run macs2 callpeak to get the d for paired-end reads? Thanks!

taoliu commented 4 years ago

Yes. macs2 callpeak will give you the average insertion length of unique pairs in your PE library.

olechnwin commented 4 years ago

Thank you so much for your clarification. Appreciate it!

olechnwin commented 4 years ago

@taoliu ,

There is something I don't understand. I tried running the subcommands above for the exact same file for treatment and control i.e. sample1.bam. At the end, none of the peaks passed pvalue threshold of 0.01 (-log10pvalue=2) which I guess is expected.

What I don't get is when I use the same file sample1.bam as input to callpeak: macs2 callpeak -t sample1.bam -c sample1.bam -f BAMPE -g hs -n test_same -B --keep-dup 1 -p 1e-2

This time, it returned ~12k peaks passing threshold. I would think the result should be the same whether or not you run the subcommands or callpeak. Granted the subcommands run have spike-in normalization but since it's the same exact file were set as treatment and control, I would think the spike-in normalization does not matter.

Can you please give some insights on why this is happening? Thank you in advance!

taoliu commented 4 years ago

@olechnwin Could you provide the entire pipeline on how you ran the subcommands? But based on your reply (https://github.com/taoliu/MACS/issues/356#issuecomment-603890497) I noticed that there are some differences between your subcommands and callpeak process. Let me explain... (it's a bit long...)

1) In callpeak pipeline, the background noise track won't be constructed by piling up the 'fragment' (i.e. in BEDPE or BAMPE mode) of input. The idea is that in the treatment data, the whole fragments provide the information on where the protein binds, however in the control data, the insertion sites or cutting sites provide the information on where the background noise comes from. So you'd better follow the instruction in the wiki, although it's for single-end data, to construct the background noise signals. Also, in callpeak pipeline, both ends of the pair in control data will be counted, in other words, the control PE data will be treated as SE and if you have N pairs in control, you are piling up the control as 2*N reads of a SE data.

2) For each background noise you generated, for the slocal, llocal, and genome background, you have to normalize them to the same coverage as the d background noise. For example: If you have N pairs in control, then you, in fact, have 2N read ends. When generating the slocal background noise, each read end will be extended to 1kbp long fragment and be piled up. Then the total coverage is 2*N*1000. You have to scale the coverage to the same as the d background noise, as doing bdgopt -m multiply -p {d/1000} on the slocal background noise. Then follow "Combine and generate the maximum background noise" and "Step 5: Scale the ChIP and control to the same sequencing depth" to get the final background noise track. Finally, remember that we are treating control as SE data, the coverage of final background noise track is assumed to be 2*N*d. So if your treatment PE data has M pairs, the coverage of treatment is assumed to be M*d, then normalization factor on control signals is M/2*N.

3) When I look back at the source codes, I think there might be a caveat in PE mode when 2N > M > N. In this case, my expected behavior should be scaling control (down) to the same coverage as treatment, but the reality may be that treatment is scaling (up) to the same coverage as control (2N). I need to do some experiments to confirm that.

olechnwin commented 4 years ago

`@ taoliu, Thank you so much for looking into this. Sorry for the delay. I got distracted... See below for my pipeline. If I understand you correctly, you were suggesting to do the subcommands instead of callpeak? Is there a way to get pileup and fold enrichment when doing the subcommand?

My pipeline, some commands were broken up into multiple lines for easy viewing:


# specify read length
read_len=150

# specify peaks cut off
log_qval_cutoff=1.301
log_pval_cutoff=2

# convert sam to bam for ChIP sample
samtools view -S -b $file | samtools sort - -o ${prefix}.sorted.bam
# remove duplicates
macs2 filterdup -f BAMPE -i ${prefix}.sorted.bam --keep-dup=1 -o ${prefix}_filterdup.bed

# convert sam to bam for control sample
samtools view -S -b $ctrl_file | samtools sort - -o ${ctrl_prefix}.sorted.bam
# remove duplicates
macs2 filterdup -f BAMPE -i ${ctrl_prefix}.sorted.bam --keep-dup=1 
    -o ${ctrl_prefix}_filterdup.bed

## generate ChIP coverage tracks
# get d (fragment size) from each peak file then calculate p normalization factor
# for background
d=$(cat $peaks_dir/${prefix}_peaks.xls | awk '/# d/ {print $4}')
d2=$(echo $d/2 | bc)
macs2 pileup -f BEDPE -B -i ${prefix}_filterdup.bed 
     -o ${prefix}_filterdup.pileup.bdg --extsize $d

## build local background track from control
# d background
macs2 pileup -f BEDPE -B -i ${ctrl_prefix}_filterdup.bed 
    -o ${ctrl_prefix}_${prefix}_filterdup_d_bg.pileup.bdg --extsize $d2
# slocal background
macs2 pileup -f BEDPE -B --extsize 500 -i ${ctrl_prefix}_filterdup.bed 
     -o ${ctrl_prefix}_${prefix}_filterdup_1k_bg.pileup.bdg
# normalize the small local (slocal) background
p=$(echo "scale=4;"$d/1000 | bc)
macs2 bdgopt -i ${ctrl_prefix}_${prefix}_filterdup_1k_bg.pileup.bdg -m multiply -p $p 
    -o ${ctrl_prefix}_${prefix}_filterdup_1k_bg_norm.bdg
# large local background default 10k (--extsize 5000)
macs2 pileup -f BEDPE -i ${ctrl_prefix}_filterdup.bed -B --extsize 5000 
    -o ${ctrl_prefix}_${prefix}_filterdup_10k_bg.bdg
# normalize the large local (llocal) background
p=$(echo "scale=4;"$d/10000 | bc)
macs2 bdgopt -i ${ctrl_prefix}_${prefix}_filterdup_10k_bg.bdg -m multiply -p $p 
    -o ${ctrl_prefix}_${prefix}_filterdup_10k_bg_norm.bdg
# combine and generate maximum background noise
macs2 bdgcmp -m max -t ${ctrl_prefix}_${prefix}_filterdup_1k_bg_norm.bdg 
    -c ${ctrl_prefix}_${prefix}_filterdup_10k_bg_norm.bdg 
    -o ${ctrl_prefix}_${prefix}_filterdup_1k_10k_bg_norm.bdg
# Then, take the maximum then by comparing with d background:
macs2 bdgcmp -m max -t ${ctrl_prefix}_${prefix}_filterdup_1k_10k_bg_norm.bdg 
    -c ${ctrl_prefix}_${prefix}_filterdup_d_bg.pileup.bdg 
    -o ${ctrl_prefix}_${prefix}_filterdup_d_1k_10k_bg_norm.bdg

## finally combine with genome wide background
#p=the_number_of_control_reads*fragment_length/genome_size
ctrl_num_read=$(wc -l ${ctrl_prefix}_filterdup.bed | awk '{print $1}') 
genome_size=2700000000
p=$(echo "scale=4;"$ctrl_num_read*$d/$genome_size | bc)
echo $ctrl_num_read $genome_size $p
macs2 bdgopt -i ${ctrl_prefix}_${prefix}_filterdup_d_1k_10k_bg_norm.bdg -m max -p $p
    -o ${ctrl_prefix}_${prefix}_filterdup_local_bias_raw.bdg        

# scaling chip and control based on spike in
treat_num_read=$(wc -l ${prefix}_filterdup.bed | awk '{print $1}')
treat_spike_num_read=$(wc -l ${prefix}.ecoli_filterdup.bed | awk '{print $1}')
ctrl_spike_num_read=$(wc -l ${ctrl_prefix}.ecoli_filterdup.bed | awk '{print $1}')
min_spike_num_read=$(wc -l *ecoli_filterdup.bed | sort -nk 1 | head -n 1 | awk '{print $1}')
scale_factor_spike_treat=$(echo "scale=4;"$min_spike_num_read/$treat_spike_num_read | bc)
scale_factor_spike_ctrl=$(echo "scale=4;"$min_spike_num_read/$ctrl_spike_num_read | bc)
p=$(echo "scale=4;"$scale_factor_spike_treat/$scale_factor_spike_ctrl | bc)
echo spike in scale factor=$p
if (( $(echo $p '>' 1 |bc -l) )); then
    p=$(echo "scale=10;"1/$p | bc)
    echo scale down control...
    # scale down control
    $run macs2 bdgopt -i ${ctrl_prefix}_${prefix}_filterdup_local_bias_raw.bdg -m multiply -p $p \
            -o ${ctrl_prefix}_${prefix}_filterdup_local_lambda.bdg
    $run macs2 bdgopt -i ${prefix}_filterdup.pileup.bdg -m multiply -p 1 -o ${prefix}_filterdup_scale.pileup.bdg
    echo scaled spike-in control= $(echo $p*$ctrl_spike_num_read | bc)
    echo scaled spike-in treat= $(echo $treat_spike_num_read | bc)
else
    echo scale down treatment
    # scale down treatment
    $run macs2 bdgopt -i ${prefix}_filterdup.pileup.bdg -m multiply -p $p -o ${prefix}_filterdup_scale.pileup.bdg
    $run macs2 bdgopt -i ${ctrl_prefix}_${prefix}_filterdup_local_bias_raw.bdg -m multiply -p 1 \
            -o ${ctrl_prefix}_${prefix}_filterdup_local_lambda.bdg
    echo scaled spike-in control= $(echo $p*$treat_spike_num_read | bc)
    echo scaled spike-in treat= $(echo $ctrl_spike_num_read | bc)
fi

# compare chip and control and calculate q and pvalue
macs2 bdgcmp -t ${prefix}_filterdup_scale.pileup.bdg 
       -c  ${ctrl_prefix}_${prefix}_filterdup_local_lambda.bdg -m qpois 
        -o ${prefix}_qvalue.bdg
macs2 bdgcmp -t ${prefix}_filterdup_scale.pileup.bdg 
       -c  ${ctrl_prefix}_${prefix}_filterdup_local_lambda.bdg -m ppois 
       -o ${prefix}_pvalue.bdg
#Call peaks on score track using a cutoff
macs2 bdgpeakcall -i ${prefix}_qvalue.bdg -c $log_qval_cutoff -l $d -g $read_len 
        -o ${prefix}_qval_peaks.bed
macs2 bdgpeakcall -i ${prefix}_pvalue.bdg -c $log_pval_cutoff -l $d -g $read_len 
        -o ${prefix}_pval_peaks.bed
`

edited: to fix mixed up scaling.
p.s: with `-f BEDPE`, you don't really need to specify `--extsize` anymore.
JohnMCMa commented 4 years ago

I noticed a further problem with this pipeline, The pileup subcommand will ignore the --extsize and -B arguments is f is either BAMPE or BEDPE. How is it even possible, then, to get the slocal and llocal?

olechnwin commented 4 years ago

@JohnMCMa Thanks for pointing this out. I didn't even notice that pileup will ignore the --extsize and -B. I compared the .bdg out with and without these options and confirmed that they are exactly the same. @taoliu can speak more on this. But briefly looking in IGV, it seemed the filterdup step correctly generated a bed that covers the entire pair-reads.

JohnMCMa commented 4 years ago

@olechnwin I looked at the code again, and I think what is done in callpeak --which at least make sense if we're dealing with BAMPE--is by calculating lambda as if the control BAM was single-ended, and then divide the lambda by 2.

In that case, I think an approximation that in the advanced pipeline would be:

  1. Make a single-ended BED file from the control BAM, for example by bedtools bamtobed.
  2. The BED file in (1) is used in all occasions of macs2 pileup during the lambda determination step.
  3. Before scaling, use macs2 bdgopt to halve all lambda values.

I'm not completely confident, however, if this is what @taoliu intended.

taoliu commented 4 years ago

@JohnMCMa Thanks for your input! You are totally right :)

In our regular callpeak command, if PE data is used, signals for the treatment/ChIP track will be computed by piling up 'fragments' of read pairs between mapped ends; whereas the noises for the control/input track will be computed by extending each end with 1/2 of slocal or 1/2 of llocal value to both left and right direction, then pilling them up. The idea is based on the following assumption. The paired ends from treatment/ChIP data are surrounding the binding sites, so the fragments between ends are where the factor binds to. On the other hand, we assume that both ends from control/input are from those biased genomic locations (e.g. highly accessible sites), so the impact of genomic bias should center at the mapping location of each end.

olechnwin commented 4 years ago

So....I am confused. Which part of the pipeline should be changed?

taoliu commented 4 years ago

@olechnwin Sorry for the confusion! I am discussing the part for ## build local background track from control when you pile up the control data.

# d background
macs2 pileup -f BEDPE -B -i ${ctrl_prefix}_filterdup.bed 
    -o ${ctrl_prefix}_${prefix}_filterdup_d_bg.pileup.bdg --extsize $d2
# slocal background
macs2 pileup -f BEDPE -B --extsize 500 -i ${ctrl_prefix}_filterdup.bed 
     -o ${ctrl_prefix}_${prefix}_filterdup_1k_bg.pileup.bdg
# large local background default 10k (--extsize 5000)
macs2 pileup -f BEDPE -i ${ctrl_prefix}_filterdup.bed -B --extsize 5000 
    -o ${ctrl_prefix}_${prefix}_filterdup_10k_bg.bdg

But let me state clearly. I am not suggesting that you have to replicate the same process in callpeak and this is the beauty of your pipeline since you can control each step. As I mentioned, the default behavior pile up control data as single-end is based on some assumption. And if you have a good reason to pile up control data as paired-end, feel free to do so. My guess is that the difference will be small.

If you want to pile up control data as single-end like callpeak, then the ${ctrl_prefix}_filterdup.bed file has to be broken down into single end BED file like:

awk -v OFS="\t" -v RL=$read_len '{print $1,$2,$2+RL,".",".","+\n"$1,$3-RL,$3,".",".","-"}' \
  ${ctrl_prefix}_filterdup.bed > ${ctrl_prefix}_filterdup.as_single_end.bed

Then:

# d background
macs2 pileup -f BED -B -i ${ctrl_prefix}_filterdup.as_single_end.bed \
  -o ${ctrl_prefix}_${prefix}_filterdup_d_bg.pileup.bdg --extsize $d2
# slocal background
macs2 pileup -f BED -B -i ${ctrl_prefix}_filterdup.as_single_end.bed \
  -o ${ctrl_prefix}_${prefix}_filterdup_1k_bg.pileup.bdg --extsize 500
# large local background default 10k (--extsize 5000)
macs2 pileup -f BED -B -i ${ctrl_prefix}_filterdup.as_single_end.bed \
  -o ${ctrl_prefix}_${prefix}_filterdup_10k_bg.pileup.bdg --extsize 5000
# divide by 2
awk -v OFS="\t" '{print $1,$2,$3/2}' ${ctrl_prefix}_${prefix}_filterdup_d_bg.pileup.bdg \
  > ${ctrl_prefix}_${prefix}_filterdup_d_bg.pileup.halved.bdg
mv ${ctrl_prefix}_${prefix}_filterdup_d_bg.pileup.halved.bdg  ${ctrl_prefix}_${prefix}_filterdup_d_bg.pileup.bdg

awk -v OFS="\t" '{print $1,$2,$3/2}' ${ctrl_prefix}_${prefix}_filterdup_1k_bg.pileup.bdg \
  > ${ctrl_prefix}_${prefix}_filterdup_1k_bg.pileup.halved.bdg; 
mv ${ctrl_prefix}_${prefix}_filterdup_1k_bg.pileup.halved.bdg  ${ctrl_prefix}_${prefix}_filterdup_1k_bg.pileup.bdg

awk -v OFS="\t" '{print $1,$2,$3/2}' ${ctrl_prefix}_${prefix}_filterdup_10k_bg.pileup.bdg \
  > ${ctrl_prefix}_${prefix}_filterdup_10k_bg.pileup.halved.bdg; 
mv ${ctrl_prefix}_${prefix}_filterdup_10k_bg.pileup.halved.bdg  ${ctrl_prefix}_${prefix}_filterdup_10k_bg.pileup.bdg

After that, just keep using the same code in your pipeline. (btw, I didn't test my code, so there may be a bug...)

olechnwin commented 4 years ago

@taoliu , Now I get it. Thank you so much for taking the time to clarify. Really appreciate it.

cool-genes commented 4 years ago

@olechnwin

Thank you for sharing your pipeline for the paired end MACS2 subcommands and Chip-Seq spike in normalization – this was immensely helpful.

I have a clarification question re: your spike-in normalization code:

When you end up with a lower number of spike-in reads in ‘treat_spike_num_read’ vs ‘ctrl_spike_num_read’ won’t the calculated p from your script be >1? In that case, shouldn’t you scale down the associated control reads and not the treatment file? It looks like your loop would scale down the treatment based on the $p '>' 1 condition. Perhaps I’ve missed something?

Thanks in advance!

olechnwin commented 4 years ago

@cool-genes,

You are so right! I mixed up the scaling. I didn't fix the pipeline when I realized this. Didn't know other people pay attention to this. I've edited the pipeline. But I do wish you noticed this before I did. That would save a lot of troubles :-) BTW, if you read Tao Liu's comment you'll see that you don't need --extsize with -f BEDPE since it'll be ignored.

olechnwin commented 4 years ago

@taoliu , as I was going to change the commands to remove --extsize since it'll be ignored with -f BEDPE, I'm not sure what should I do with p in the following command where it used to be d/10000?

$run macs2 bdgopt -i ${ctrl_prefix}_${prefix}_filterdup_10k_bg.bdg -m multiply -p $p -o ${ctrl_prefix}_${prefix}_filterdup_10k_bg_norm.bdg

Also, does that mean I don't have slocal and llocal anymore?

ClaireXinSun commented 1 year ago

@taoliu I am using deeptools bamcovergae $sfactor to get the normalised bed files from bam files of some cut&run assays. Then I used the normalised bed files as the input in macs. Not sure whether it's right or not. -f BEDPE didn't work, only -f BED --nomodel --extsize 200 --shift -100 -B.

annsophiegironne commented 1 year ago

@olechnwin, thank you so much for sharing your pipeline! It helped me a lot to understand how to do this with the tutorial (Thank you @taoliu )

I have one question about your pipeline however; where does the ecoli_filterdup.bed file come from? I assume this is your spike-in data. I'm doing chip-seq analysis with spike-in normalization for the first time and I have to admit, I am a bit confused as to how it works exactly. What did you do with it? And did you align to a dual genome or to each genome individually? Thanks a lot!

olechnwin commented 1 year ago

@annsophiegironne,

ecoli is my spike-in. I aligned my reads to each genome individually. ecoli_filterdup.bed was the output from running macs2 filterdup.

DavideBrex commented 5 months ago

@annsophiegironne,

ecoli is my spike-in. I aligned my reads to each genome individually. ecoli_filterdup.bed was the output from running macs2 filterdup.

Hi, thank you for providing the pipeline! I was curious to ask you if you ever compared the output of spike-in normalised peak calling with normal peak calling and if you noticed any substantial differences. This would be really interesting! Also, did you use this pipeline in any publication? Thank you!

Davide