Open tudumanne opened 4 years ago
@tudumanne Thanks for asking the question! When I wrote the step-by-step tutorial to call peaks using subcommands in 2017, MACS2 can't do pileup
subcommand on paired-end data. From v2.1.2, MACS2 fully supports paired-end data in subcommand. I will update the tutorial when I get time. In brief, you just need to skip the step2, and in step3, do macs2 pileup -f BAMPE ...
or macs2 pileup -f BEDPE ...
to get the pileup signal tracks. Then the following steps would be the same.
@taoliu Thanks a lot for your reply.
@tudumanne You are welcome!
I feel like my question belong here. Sorry if it isn't.
I'm trying to incorporate spike-in normalization for paired-end as OP. I've done step 1-3, skipping step 2 from the sub commands tutorial.
However, in step 4, after a background noise track of slocal local window was created, the next step is to normalize the 1kb noise by multiplying the values by d/slocal
. Since I am skipping step 2, I don't have d
.
So, my question, what do I put in for d/slocal
?
I'm copying the steps I've done so far:
Filter duplicates
macs2 filterdup -f BAMPE -i ${prefix}.sorted.bam --keep-dup=1 -o ${prefix}_filterdup.bed
macs2 filterdup -f BAMPE -i ${prefix_ctrl}.sorted.bam --keep-dup=1 -o ${prefix_ctrl}_filterdup.bed
Generate pileup tracks for ChIP sample:
macs2 pileup -f BEDPE -i ${prefix}_filterdup.bed -o ${prefix}_filterdup.pileup.bdg
Create background noise track:
d background
macs2 pileup -f BEDPE -B -i ${prefix_ctrl}_filterdup.bed -o ${prefix_ctrl}_filterdup_d_bg.pileup.bdg
slocal background
macs2 pileup -f BEDPE -B --extsize 500 -i ${prefix_ctrl}_filterdup.bed -o ${prefix_ctrl}_filterdup_1k_bg.pileup.bdg
Next come the step to normalize 1kb noise which I'm not sure what to put forp
.
This is the command from the tutorial:
macs2 bdgopt -i 1k_bg.bdg -m multiply -p 0.254 -o 1k_bg_norm.bdg
Thank you in advance for your help!
Hi @taoliu,
I have two questions:
Can I get the fragment size (d
) from the output of callpeak
on the ChIP file? So run the following command, where ${prefix}.bam
is the bam file of ChIP sample and ${prefix_ctrl}
is the bam file of my control sample and get d
from the output?
macs2 callpeak -t ${prefix}.bam -c ${prefix_ctrl}.bam -f BAMPE -g hs -B --keep-dup all
In the advance tutorial, step 5 scaling ChIP and control to the same sequencing depth , it was recommended to scale down. So if the number of unique reads for ChIP and control are 199867 and 199583, so the ChIP bias is the one that have to be scaled down, by multiplying with the ratio between control and ChIP which is 199583/199867=.99858. Is this the correct way to scale down the ChIP bias?
macs2 bdgopt -i ${prefix}_filterdup.pileup.bdg -m multiply -p .99858 -o ${prefix}_filterdup.scale.pileup.bdg
Thank you in advance for your help!
@olechnwin
d
by running macs2 callpeak
. Or follow the step 1 and step 2 of the advanced tutorial.@taoliu
Just want to make sure. The d
I was referring above is for paired-end reads. So, I can run macs2 callpeak
to get the d
for paired-end reads?
Thanks!
Yes. macs2 callpeak
will give you the average insertion length of unique pairs in your PE library.
Thank you so much for your clarification. Appreciate it!
@taoliu ,
There is something I don't understand. I tried running the subcommands above for the exact same file for treatment and control i.e. sample1.bam
. At the end, none of the peaks passed pvalue threshold of 0.01 (-log10pvalue=2) which I guess is expected.
What I don't get is when I use the same file sample1.bam
as input to callpeak
:
macs2 callpeak -t sample1.bam -c sample1.bam -f BAMPE -g hs -n test_same -B --keep-dup 1 -p 1e-2
This time, it returned ~12k peaks passing threshold. I would think the result should be the same whether or not you run the subcommands or callpeak
.
Granted the subcommands run have spike-in normalization but since it's the same exact file were set as treatment and control, I would think the spike-in normalization does not matter.
Can you please give some insights on why this is happening? Thank you in advance!
@olechnwin Could you provide the entire pipeline on how you ran the subcommands? But based on your reply (https://github.com/taoliu/MACS/issues/356#issuecomment-603890497) I noticed that there are some differences between your subcommands and callpeak
process. Let me explain... (it's a bit long...)
1) In callpeak
pipeline, the background noise track won't be constructed by piling up the 'fragment' (i.e. in BEDPE or BAMPE mode) of input. The idea is that in the treatment data, the whole fragments provide the information on where the protein binds, however in the control data, the insertion sites or cutting sites provide the information on where the background noise comes from. So you'd better follow the instruction in the wiki, although it's for single-end data, to construct the background noise signals. Also, in callpeak
pipeline, both ends of the pair in control data will be counted, in other words, the control PE data will be treated as SE and if you have N pairs in control, you are piling up the control as 2*N reads of a SE data.
2) For each background noise you generated, for the slocal
, llocal
, and genome background, you have to normalize them to the same coverage as the d
background noise. For example: If you have N
pairs in control, then you, in fact, have 2N
read ends. When generating the slocal
background noise, each read end will be extended to 1kbp long fragment and be piled up. Then the total coverage is 2*N*1000
. You have to scale the coverage to the same as the d
background noise, as doing bdgopt -m multiply -p {d/1000}
on the slocal
background noise. Then follow "Combine and generate the maximum background noise" and "Step 5: Scale the ChIP and control to the same sequencing depth" to get the final background noise track. Finally, remember that we are treating control as SE data, the coverage of final background noise track is assumed to be 2*N*d
. So if your treatment PE data has M
pairs, the coverage of treatment is assumed to be M*d
, then normalization factor on control signals is M/2*N
.
3) When I look back at the source codes, I think there might be a caveat in PE mode when 2N > M > N. In this case, my expected behavior should be scaling control (down) to the same coverage as treatment, but the reality may be that treatment is scaling (up) to the same coverage as control (2N). I need to do some experiments to confirm that.
`@ taoliu, Thank you so much for looking into this. Sorry for the delay. I got distracted... See below for my pipeline. If I understand you correctly, you were suggesting to do the subcommands instead of callpeak? Is there a way to get pileup and fold enrichment when doing the subcommand?
My pipeline, some commands were broken up into multiple lines for easy viewing:
# specify read length
read_len=150
# specify peaks cut off
log_qval_cutoff=1.301
log_pval_cutoff=2
# convert sam to bam for ChIP sample
samtools view -S -b $file | samtools sort - -o ${prefix}.sorted.bam
# remove duplicates
macs2 filterdup -f BAMPE -i ${prefix}.sorted.bam --keep-dup=1 -o ${prefix}_filterdup.bed
# convert sam to bam for control sample
samtools view -S -b $ctrl_file | samtools sort - -o ${ctrl_prefix}.sorted.bam
# remove duplicates
macs2 filterdup -f BAMPE -i ${ctrl_prefix}.sorted.bam --keep-dup=1
-o ${ctrl_prefix}_filterdup.bed
## generate ChIP coverage tracks
# get d (fragment size) from each peak file then calculate p normalization factor
# for background
d=$(cat $peaks_dir/${prefix}_peaks.xls | awk '/# d/ {print $4}')
d2=$(echo $d/2 | bc)
macs2 pileup -f BEDPE -B -i ${prefix}_filterdup.bed
-o ${prefix}_filterdup.pileup.bdg --extsize $d
## build local background track from control
# d background
macs2 pileup -f BEDPE -B -i ${ctrl_prefix}_filterdup.bed
-o ${ctrl_prefix}_${prefix}_filterdup_d_bg.pileup.bdg --extsize $d2
# slocal background
macs2 pileup -f BEDPE -B --extsize 500 -i ${ctrl_prefix}_filterdup.bed
-o ${ctrl_prefix}_${prefix}_filterdup_1k_bg.pileup.bdg
# normalize the small local (slocal) background
p=$(echo "scale=4;"$d/1000 | bc)
macs2 bdgopt -i ${ctrl_prefix}_${prefix}_filterdup_1k_bg.pileup.bdg -m multiply -p $p
-o ${ctrl_prefix}_${prefix}_filterdup_1k_bg_norm.bdg
# large local background default 10k (--extsize 5000)
macs2 pileup -f BEDPE -i ${ctrl_prefix}_filterdup.bed -B --extsize 5000
-o ${ctrl_prefix}_${prefix}_filterdup_10k_bg.bdg
# normalize the large local (llocal) background
p=$(echo "scale=4;"$d/10000 | bc)
macs2 bdgopt -i ${ctrl_prefix}_${prefix}_filterdup_10k_bg.bdg -m multiply -p $p
-o ${ctrl_prefix}_${prefix}_filterdup_10k_bg_norm.bdg
# combine and generate maximum background noise
macs2 bdgcmp -m max -t ${ctrl_prefix}_${prefix}_filterdup_1k_bg_norm.bdg
-c ${ctrl_prefix}_${prefix}_filterdup_10k_bg_norm.bdg
-o ${ctrl_prefix}_${prefix}_filterdup_1k_10k_bg_norm.bdg
# Then, take the maximum then by comparing with d background:
macs2 bdgcmp -m max -t ${ctrl_prefix}_${prefix}_filterdup_1k_10k_bg_norm.bdg
-c ${ctrl_prefix}_${prefix}_filterdup_d_bg.pileup.bdg
-o ${ctrl_prefix}_${prefix}_filterdup_d_1k_10k_bg_norm.bdg
## finally combine with genome wide background
#p=the_number_of_control_reads*fragment_length/genome_size
ctrl_num_read=$(wc -l ${ctrl_prefix}_filterdup.bed | awk '{print $1}')
genome_size=2700000000
p=$(echo "scale=4;"$ctrl_num_read*$d/$genome_size | bc)
echo $ctrl_num_read $genome_size $p
macs2 bdgopt -i ${ctrl_prefix}_${prefix}_filterdup_d_1k_10k_bg_norm.bdg -m max -p $p
-o ${ctrl_prefix}_${prefix}_filterdup_local_bias_raw.bdg
# scaling chip and control based on spike in
treat_num_read=$(wc -l ${prefix}_filterdup.bed | awk '{print $1}')
treat_spike_num_read=$(wc -l ${prefix}.ecoli_filterdup.bed | awk '{print $1}')
ctrl_spike_num_read=$(wc -l ${ctrl_prefix}.ecoli_filterdup.bed | awk '{print $1}')
min_spike_num_read=$(wc -l *ecoli_filterdup.bed | sort -nk 1 | head -n 1 | awk '{print $1}')
scale_factor_spike_treat=$(echo "scale=4;"$min_spike_num_read/$treat_spike_num_read | bc)
scale_factor_spike_ctrl=$(echo "scale=4;"$min_spike_num_read/$ctrl_spike_num_read | bc)
p=$(echo "scale=4;"$scale_factor_spike_treat/$scale_factor_spike_ctrl | bc)
echo spike in scale factor=$p
if (( $(echo $p '>' 1 |bc -l) )); then
p=$(echo "scale=10;"1/$p | bc)
echo scale down control...
# scale down control
$run macs2 bdgopt -i ${ctrl_prefix}_${prefix}_filterdup_local_bias_raw.bdg -m multiply -p $p \
-o ${ctrl_prefix}_${prefix}_filterdup_local_lambda.bdg
$run macs2 bdgopt -i ${prefix}_filterdup.pileup.bdg -m multiply -p 1 -o ${prefix}_filterdup_scale.pileup.bdg
echo scaled spike-in control= $(echo $p*$ctrl_spike_num_read | bc)
echo scaled spike-in treat= $(echo $treat_spike_num_read | bc)
else
echo scale down treatment
# scale down treatment
$run macs2 bdgopt -i ${prefix}_filterdup.pileup.bdg -m multiply -p $p -o ${prefix}_filterdup_scale.pileup.bdg
$run macs2 bdgopt -i ${ctrl_prefix}_${prefix}_filterdup_local_bias_raw.bdg -m multiply -p 1 \
-o ${ctrl_prefix}_${prefix}_filterdup_local_lambda.bdg
echo scaled spike-in control= $(echo $p*$treat_spike_num_read | bc)
echo scaled spike-in treat= $(echo $ctrl_spike_num_read | bc)
fi
# compare chip and control and calculate q and pvalue
macs2 bdgcmp -t ${prefix}_filterdup_scale.pileup.bdg
-c ${ctrl_prefix}_${prefix}_filterdup_local_lambda.bdg -m qpois
-o ${prefix}_qvalue.bdg
macs2 bdgcmp -t ${prefix}_filterdup_scale.pileup.bdg
-c ${ctrl_prefix}_${prefix}_filterdup_local_lambda.bdg -m ppois
-o ${prefix}_pvalue.bdg
#Call peaks on score track using a cutoff
macs2 bdgpeakcall -i ${prefix}_qvalue.bdg -c $log_qval_cutoff -l $d -g $read_len
-o ${prefix}_qval_peaks.bed
macs2 bdgpeakcall -i ${prefix}_pvalue.bdg -c $log_pval_cutoff -l $d -g $read_len
-o ${prefix}_pval_peaks.bed
`
edited: to fix mixed up scaling.
p.s: with `-f BEDPE`, you don't really need to specify `--extsize` anymore.
I noticed a further problem with this pipeline, The pileup
subcommand will ignore the --extsize
and -B
arguments is f
is either BAMPE
or BEDPE
. How is it even possible, then, to get the slocal and llocal?
@JohnMCMa
Thanks for pointing this out. I didn't even notice that pileup
will ignore the --extsize
and -B
.
I compared the .bdg out with and without these options and confirmed that they are exactly the same.
@taoliu can speak more on this. But briefly looking in IGV, it seemed the filterdup step correctly generated a bed that covers the entire pair-reads.
@olechnwin
I looked at the code again, and I think what is done in callpeak
--which at least make sense if we're dealing with BAMPE--is by calculating lambda as if the control BAM was single-ended, and then divide the lambda by 2.
In that case, I think an approximation that in the advanced pipeline would be:
bedtools bamtobed
.macs2 pileup
during the lambda determination step.macs2 bdgopt
to halve all lambda values.I'm not completely confident, however, if this is what @taoliu intended.
@JohnMCMa Thanks for your input! You are totally right :)
In our regular callpeak
command, if PE data is used, signals for the treatment/ChIP track will be computed by piling up 'fragments' of read pairs between mapped ends; whereas the noises for the control/input track will be computed by extending each end with 1/2 of slocal
or 1/2 of llocal
value to both left and right direction, then pilling them up. The idea is based on the following assumption. The paired ends from treatment/ChIP data are surrounding the binding sites, so the fragments between ends are where the factor binds to. On the other hand, we assume that both ends from control/input are from those biased genomic locations (e.g. highly accessible sites), so the impact of genomic bias should center at the mapping location of each end.
So....I am confused. Which part of the pipeline should be changed?
@olechnwin Sorry for the confusion! I am discussing the part for ## build local background track from control when you pile up the control data.
# d background
macs2 pileup -f BEDPE -B -i ${ctrl_prefix}_filterdup.bed
-o ${ctrl_prefix}_${prefix}_filterdup_d_bg.pileup.bdg --extsize $d2
# slocal background
macs2 pileup -f BEDPE -B --extsize 500 -i ${ctrl_prefix}_filterdup.bed
-o ${ctrl_prefix}_${prefix}_filterdup_1k_bg.pileup.bdg
# large local background default 10k (--extsize 5000)
macs2 pileup -f BEDPE -i ${ctrl_prefix}_filterdup.bed -B --extsize 5000
-o ${ctrl_prefix}_${prefix}_filterdup_10k_bg.bdg
But let me state clearly. I am not suggesting that you have to replicate the same process in callpeak
and this is the beauty of your pipeline since you can control each step. As I mentioned, the default behavior pile up control data as single-end is based on some assumption. And if you have a good reason to pile up control data as paired-end, feel free to do so. My guess is that the difference will be small.
If you want to pile up control data as single-end like callpeak
, then the ${ctrl_prefix}_filterdup.bed
file has to be broken down into single end BED file like:
awk -v OFS="\t" -v RL=$read_len '{print $1,$2,$2+RL,".",".","+\n"$1,$3-RL,$3,".",".","-"}' \
${ctrl_prefix}_filterdup.bed > ${ctrl_prefix}_filterdup.as_single_end.bed
Then:
# d background
macs2 pileup -f BED -B -i ${ctrl_prefix}_filterdup.as_single_end.bed \
-o ${ctrl_prefix}_${prefix}_filterdup_d_bg.pileup.bdg --extsize $d2
# slocal background
macs2 pileup -f BED -B -i ${ctrl_prefix}_filterdup.as_single_end.bed \
-o ${ctrl_prefix}_${prefix}_filterdup_1k_bg.pileup.bdg --extsize 500
# large local background default 10k (--extsize 5000)
macs2 pileup -f BED -B -i ${ctrl_prefix}_filterdup.as_single_end.bed \
-o ${ctrl_prefix}_${prefix}_filterdup_10k_bg.pileup.bdg --extsize 5000
# divide by 2
awk -v OFS="\t" '{print $1,$2,$3/2}' ${ctrl_prefix}_${prefix}_filterdup_d_bg.pileup.bdg \
> ${ctrl_prefix}_${prefix}_filterdup_d_bg.pileup.halved.bdg
mv ${ctrl_prefix}_${prefix}_filterdup_d_bg.pileup.halved.bdg ${ctrl_prefix}_${prefix}_filterdup_d_bg.pileup.bdg
awk -v OFS="\t" '{print $1,$2,$3/2}' ${ctrl_prefix}_${prefix}_filterdup_1k_bg.pileup.bdg \
> ${ctrl_prefix}_${prefix}_filterdup_1k_bg.pileup.halved.bdg;
mv ${ctrl_prefix}_${prefix}_filterdup_1k_bg.pileup.halved.bdg ${ctrl_prefix}_${prefix}_filterdup_1k_bg.pileup.bdg
awk -v OFS="\t" '{print $1,$2,$3/2}' ${ctrl_prefix}_${prefix}_filterdup_10k_bg.pileup.bdg \
> ${ctrl_prefix}_${prefix}_filterdup_10k_bg.pileup.halved.bdg;
mv ${ctrl_prefix}_${prefix}_filterdup_10k_bg.pileup.halved.bdg ${ctrl_prefix}_${prefix}_filterdup_10k_bg.pileup.bdg
After that, just keep using the same code in your pipeline. (btw, I didn't test my code, so there may be a bug...)
@taoliu , Now I get it. Thank you so much for taking the time to clarify. Really appreciate it.
@olechnwin
Thank you for sharing your pipeline for the paired end MACS2 subcommands and Chip-Seq spike in normalization – this was immensely helpful.
I have a clarification question re: your spike-in normalization code:
When you end up with a lower number of spike-in reads in ‘treat_spike_num_read’ vs ‘ctrl_spike_num_read’ won’t the calculated p from your script be >1? In that case, shouldn’t you scale down the associated control reads and not the treatment file? It looks like your loop would scale down the treatment based on the $p '>' 1 condition. Perhaps I’ve missed something?
Thanks in advance!
@cool-genes,
You are so right! I mixed up the scaling. I didn't fix the pipeline when I realized this. Didn't know other people pay attention to this. I've edited the pipeline. But I do wish you noticed this before I did. That would save a lot of troubles :-)
BTW, if you read Tao Liu's comment you'll see that you don't need --extsize
with -f BEDPE
since it'll be ignored.
@taoliu , as I was going to change the commands to remove --extsize
since it'll be ignored with -f BEDPE
, I'm not sure what should I do with p
in the following command where it used to be d/10000
?
$run macs2 bdgopt -i ${ctrl_prefix}_${prefix}_filterdup_10k_bg.bdg -m multiply -p $p -o ${ctrl_prefix}_${prefix}_filterdup_10k_bg_norm.bdg
Also, does that mean I don't have slocal and llocal anymore?
@taoliu I am using deeptools bamcovergae $sfactor to get the normalised bed files from bam files of some cut&run assays.
Then I used the normalised bed files as the input in macs. Not sure whether it's right or not.
-f BEDPE
didn't work, only -f BED --nomodel --extsize 200 --shift -100 -B
.
@olechnwin, thank you so much for sharing your pipeline! It helped me a lot to understand how to do this with the tutorial (Thank you @taoliu )
I have one question about your pipeline however; where does the ecoli_filterdup.bed
file come from? I assume this is your spike-in data. I'm doing chip-seq analysis with spike-in normalization for the first time and I have to admit, I am a bit confused as to how it works exactly. What did you do with it? And did you align to a dual genome or to each genome individually? Thanks a lot!
@annsophiegironne,
ecoli is my spike-in. I aligned my reads to each genome individually. ecoli_filterdup.bed was the output from running macs2 filterdup.
@annsophiegironne,
ecoli is my spike-in. I aligned my reads to each genome individually. ecoli_filterdup.bed was the output from running macs2 filterdup.
Hi, thank you for providing the pipeline! I was curious to ask you if you ever compared the output of spike-in normalised peak calling with normal peak calling and if you noticed any substantial differences. This would be really interesting! Also, did you use this pipeline in any publication? Thank you!
Davide
Hello, I have been trying to figure out how to incorporate external spike-in normalisation factors to both single-end and paired-end ChIP-seq data for peak calling with MACS2.
I tried peak calling through sub commands - by scaling ChIP and control samples using 'bdgopt' subcommand based on calculated spike-in normalisation factor. However, at the moment "Advanced: Call peaks using MACS2 subcommands" tutorial provides instruction only for single-end data.
What would be the best way to perform this type of analysis for paired end data?
Thank you very much in advance.