Q: Using MACS2 callpeak for bacteria ChIP-Seq data with --keep-up 1 reports too few paired peaks

huizhen2014 commented 4 years ago

Describe the bug Flowing the instructions, I chose the --keep-up 1 to callpeak with macs2 for bacterial ChIP-Seq data. But, it reported "Too few paired peaks So I can not build the model! ...". Then, I changed the parameter with --keep-up all, It reported the seemingly right results. So, I was wandering If I should keep the parameter --keep-up all when I analyzed the bacterial ChIP-Seq data, or I could modified any other parameters to fit the analysis. Thanks.

To Reproduce Command line: callpeak --SPMR --trackline --bdg --keep-dup 1 --mfold 5 50 --gsize 5.3e6 --format BAM --name RNAP_beta --outdir RNAP_beta_output -t ERR654293.RNAP_beta_filtered_sorted.bam ERR654291.RNAP_beta_filtered_sorted.bam -c ERR654295.untagged_filtered_sorted.bam ERR654288.untagged_filtered_sorted.bam

Screenshots

System (please complete the following information):

OS: [e.g. Mac OSX]
Python version [e.g. 3.7.5]
MACS Version [e.g. 2.2.5]

taoliu commented 4 years ago

@huizhen2014 My two comments: 1) When the genome size is small, to keep only 1 unique pair may dilute the real signals. You may consider subsample your ChIP data or set a higher number for keep-dup option. You currently have 5 million 50bps reads which provide, assuming uniformly distributed in the bacterial genome, a 50 fold coverage. I would suggest you subsample only 1/50 of your data and redo the analysis. 2) The complexity in the input data is strangely low, with 99% redudant rate. You may need to check if the experiment/data really works. In general, an input is like whole genome DNA sequencing with a more uniform distribution than ChIP data. However only 37.2K unique 50bps reads were kept which means only small proportion of bacteria genome is covered in the input.

huizhen2014 commented 4 years ago

Thanks, I will flow your suggestions and redo the analysis.

huizhen2014 commented 4 years ago

Oh, sorry, I forgot to ask If I should decrease the mfold parameter, such as --mfold 2 50 or 1 50, to get more peaks and then do some manually check to eliminate some obviously negative results, as the bacterial genes are much less then those in Eukaryotic organisms, with will present less peaks apparently. Thanks.

taoliu commented 4 years ago

@huizhen2014 My suggestion is to skip model building and set a fixed value for extsize.

huizhen2014 commented 4 years ago

Thanks, and sorry for my delayed sequel question. According to my comprehension, I could use the normal analysis info output in the screen or the results files by macs2, such as the suggested fragment length, fix the extsize parameter then re-run the analysis.

yunfei-dai commented 4 years ago

Hi Professor Liu,

Thanks for replying to my email on Friday with all the suggestions. I tried skipping the model building with an extsize of 200 and it did call the peaks. The problem now is that it called too many peaks. When sorting those peaks with p-values, however, I found the top peaks are indeed the ones we expected. Next I tried subsampling the input file. I subsampled 10% of the input file, which reduced the read numbers from about 47 million to 4.7 million. This time macs called fewer peaks -- about half to that before subsampling. Do you think I should further subsample the bam files (both treated and control) and call peaks again, or should I adjust other parameters to get more reliable peak calling? I attached below the outputs from peak calling with subsampled control file.

Thanks again for your time and advice!

Yunfei

taoliu commented 4 years ago

@yunfei-dai Your treatment file after downsampling contains 3 million reads, while the input has 4 million reads. I think you can sub-sample them further since the current genome coverage is still high -- 3million * 50 / 4 million (genome size = 4million) ~ 40 fold. In another word, if the ChIP-seq reads are uniformly distributed across the whole genome, although they shouldn't, you would see by average 40 reads covering the same position. If you check the human chip-seq data, even 50 million reads ( about 1 fold genome coverage) is already enough. So even if you downsample your library to 1 million reads, it should still give you reliable results. You can down-sample multiple times to make technical replicates and check if your results are consistent.

millerh1 commented 4 years ago

Hello -- I am also experiencing this issue with some sequencing data in the yeast genome. I expect the issue is that the genome is too small. However, I also feel like it's probably not the best practice to set an arbitrary ext-size when I don't already know that information or to be down sampling. Is there any other way to handle this issue which still being able to model d?

Here is my output:

(RSeqEnv) millerh1@cbbi16:~/Bishop.lab$ macs2 predictd -m 5 50 -i "SRX1761643_UNKNOWN.sacCer3.experiment.bam"
INFO  @ Thu, 01 Oct 2020 14:52:04: # read alignment files...
INFO  @ Thu, 01 Oct 2020 14:52:04: # read treatment tags...
INFO  @ Thu, 01 Oct 2020 14:52:04: Detected format is: BAM
INFO  @ Thu, 01 Oct 2020 14:52:04: * Input file is gzipped.
INFO  @ Thu, 01 Oct 2020 14:52:06:  1000000
INFO  @ Thu, 01 Oct 2020 14:52:08:  2000000
INFO  @ Thu, 01 Oct 2020 14:52:10:  3000000
INFO  @ Thu, 01 Oct 2020 14:52:11:  4000000
INFO  @ Thu, 01 Oct 2020 14:52:13:  5000000
INFO  @ Thu, 01 Oct 2020 14:52:15:  6000000
INFO  @ Thu, 01 Oct 2020 14:52:16:  7000000
INFO  @ Thu, 01 Oct 2020 14:52:16: 7073619 reads have been read.
INFO  @ Thu, 01 Oct 2020 14:52:16: tag size is determined as 50 bps
INFO  @ Thu, 01 Oct 2020 14:52:16: # tag size = 50
INFO  @ Thu, 01 Oct 2020 14:52:16: # total tags in alignment file: 7073619
INFO  @ Thu, 01 Oct 2020 14:52:16: # Build Peak Model...
INFO  @ Thu, 01 Oct 2020 14:52:16: #2 looking for paired plus/minus strand peaks...
INFO  @ Thu, 01 Oct 2020 14:52:17: #2 number of paired peaks: 99
WARNING @ Thu, 01 Oct 2020 14:52:17: Too few paired peaks (99) so I can not build the model! Broader your MFOLD range parameter may erase this error. If it still can't build the model, we suggest to use --nomodel and --extsize 147 or other fixed number instead.
WARNING @ Thu, 01 Oct 2020 14:52:17: Process for pairing-model is terminated!
WARNING @ Thu, 01 Oct 2020 14:52:17: # Can't find enough pairs of symmetric peaks to build model!

I noticed that even though the genome of yeast is small, there seemed to be plenty of non-redundant pairs to build from:

Either way, when I broaden the MFOLD it seems to work -- but I don't know if this is really appropriate:

(RSeqEnv) millerh1@cbbi16:~/Bishop.lab$ macs2 predictd -m 1 100 -i "SRX1761643_UNKNOWN.sacCer3.experiment.bam"
INFO  @ Thu, 01 Oct 2020 14:54:23: # read alignment files...
INFO  @ Thu, 01 Oct 2020 14:54:23: # read treatment tags...
INFO  @ Thu, 01 Oct 2020 14:54:23: Detected format is: BAM
INFO  @ Thu, 01 Oct 2020 14:54:23: * Input file is gzipped.
INFO  @ Thu, 01 Oct 2020 14:54:25:  1000000
INFO  @ Thu, 01 Oct 2020 14:54:27:  2000000
INFO  @ Thu, 01 Oct 2020 14:54:29:  3000000
INFO  @ Thu, 01 Oct 2020 14:54:30:  4000000
INFO  @ Thu, 01 Oct 2020 14:54:32:  5000000
INFO  @ Thu, 01 Oct 2020 14:54:34:  6000000
INFO  @ Thu, 01 Oct 2020 14:54:35:  7000000
INFO  @ Thu, 01 Oct 2020 14:54:36: 7073619 reads have been read.
INFO  @ Thu, 01 Oct 2020 14:54:36: tag size is determined as 50 bps
INFO  @ Thu, 01 Oct 2020 14:54:36: # tag size = 50
INFO  @ Thu, 01 Oct 2020 14:54:36: # total tags in alignment file: 7073619
INFO  @ Thu, 01 Oct 2020 14:54:36: # Build Peak Model...
INFO  @ Thu, 01 Oct 2020 14:54:36: #2 looking for paired plus/minus strand peaks...
INFO  @ Thu, 01 Oct 2020 14:54:36: #2 number of paired peaks: 237
WARNING @ Thu, 01 Oct 2020 14:54:36: Fewer paired peaks (237) than 1000! Model may not be build well! Lower your MFOLD parameter may erase this warning. Now I will use 237 pairs to build model!
INFO  @ Thu, 01 Oct 2020 14:54:36: start model_add_line...
INFO  @ Thu, 01 Oct 2020 14:54:36: start X-correlation...
INFO  @ Thu, 01 Oct 2020 14:54:36: end of X-cor
INFO  @ Thu, 01 Oct 2020 14:54:36: # finished!
INFO  @ Thu, 01 Oct 2020 14:54:36: # predicted fragment length is 49 bps
INFO  @ Thu, 01 Oct 2020 14:54:36: # alternative fragment length(s) may be 49,216,257,364,441,514,592 bps
INFO  @ Thu, 01 Oct 2020 14:54:36: # Generate R script for model : predictd

Thank you for your time,

Henry

taoliu commented 4 years ago

@millerh1 Hi Henry, it seems you didn't set -g for predictd. By default, it's the human genome size. It means that the -m MFOLD is based on the average on a genome with human genome size. It also means the -m 1 100 -g hs is equivalent to -g 12000000 -m 200 20000. It seems that, at such a high MFOLD setting, the 'phantom peak' ( https://genome.ucsc.edu/ENCODE/qualityMetrics.html#chipSeq, where the estimated extsize from cross-correlation is the same as the read length) jumps out just because that now the unique + strand and - strand reads are mostly 'uniformly' distributed. I may use the R script output from predictd to generate a PDF figure and check if the other alternative fragment lengths are of any good. Another suggestion is to further randomly subsample your data and try again.

millerh1 commented 4 years ago

Thank you for looking into this. Changing to the correct genome size didn't make any difference with respect to this issue. Only changing the MFOLD seemed to help. Considering how common it is to want to use smaller genomes, such as yeast, it seems worthwhile to have a way to deal with this within the model-based approach MACS2 takes.

This is the model I got:

This doesn't seem very good to me -- perhaps no way to pick a distance given there isn't a clear peak?

For reference -- here is the cross-correlation:

macs3-project / MACS

Q: Using MACS2 callpeak for bacteria ChIP-Seq data with --keep-up 1 reports too few paired peaks #353