d calculated from paired-peaks are smaller than 2*tag length

macs3-project / MACS

MACS -- Model-based Analysis of ChIP-Seq

https://macs3-project.github.io/MACS/

BSD 3-Clause "New" or "Revised" License

713 stars 268 forks source link

d calculated from paired-peaks are smaller than 2*tag length #258

Open stevehxf opened 6 years ago

stevehxf commented 6 years ago

Hi Tao,

I used INPUT as my ctrl and used default parameters to call narrow peaks. I encountered a warning: Since the d(171) calculated from paired-peaks are smaller than 2*tag length, it may be influenced by unknown sequencing problem!

Any idea what happened and how to fix it?

Thanks!

Best, Steve

taoliu commented 6 years ago

Since in the most of recent cases the read length from Illumina sequencer can be fairly big as 150bps, this warning would raise no big issue on your dataset. But if you have many datasets to be processed, to make your analysis more consistent, I recommend you keep using a fixed ‘d' value for all your samples by setting, for example '--nomodel --extsize 300'.

knowah commented 6 years ago

Hi Tao,

Thanks for responding to Steve's question as I was having the same issue. I am wondering how much of an effect the --extsize argument will have on the results, since my predicted 'd' value is about 240 in my samples (I have 150bp PE reads)? Should I use that value or should I use something higher like 300?

Additionally, for a particular sample, the predicted 'd' from macs2 predictd is 219 while the predicted 'd' from macs2 callpeak on the same sample with its input is 234. In both cases I was using the default arguments (e.g. -m 5 50). Is this because the predictd function does not consider the control/input sample?

Thanks, Noah

taoliu commented 6 years ago

@knowah First, about your additional question, predictd won't filter out redundant reads as callpeak. So, to better simulate callpeak way, you have to do macs2 filterdup --keep-dup 1 on your ChIP sample. Also, the fragment size prediction has nothing to do with control/input sample.

The extsize is just a matter of 'smoothing' of your ChIP-seq data. Ideally, you won't see any big difference in terms of peaks called between 230 and 240. Nowadays, you can think this feature purely as a 'data quality control' method that can be utilized to evaluate your ChIP sample. If you see a much small number such as 50bp, then you should worry about something wrong with the library preparation. If you are satisfied with the data quality, to ignore the tiny differences from predictd and to use a fixed --nomodel --extsize N is always recommended.