macs3-project / MACS

MACS -- Model-based Analysis of ChIP-Seq
https://macs3-project.github.io/MACS/
BSD 3-Clause "New" or "Revised" License
700 stars 268 forks source link

Q: HMMRATAC producing too many peaks #636

Open igordot opened 6 months ago

igordot commented 6 months ago

I ran HMMRATAC on two samples that should be biological replicates. The quality difference between them is high, although the quality is far from ideal for both. However, the low quality sample produced more than 10x as many peaks (<10k versus >100k) with the default settings.

For comparison, the Java version of HMMRATAC produced <1k peaks for the first sample and no peaks for the second sample (score cutoff of 30).

Here is a genome browser screenshot to illustrate the situation:

image

The called peaks for the first sample look like peaks. Not so much for the second sample. Is this expected? Are there some guidelines to help avoid this?

taoliu commented 5 months ago

One thing you can try is to increase the cutoff for calling candidate regions in the macs3 version of hmmratac using the option for prescan-cutoff -c. And perhaps first run the --cutoff-analysis in hmmratac to decide the OK parameter for -c. Also it is likely that the cutoff values for -u and -l for training HMM have to be adjusted as well. Check https://macs3-project.github.io/MACS/docs/hmmratac.html#choices-of-cutoff-values

Please let me know if this can help.

igordot commented 5 months ago

I saw that, but wasn't sure how to properly use the output to determine the optimal values. Do you have an example documented somewhere?

taoliu commented 5 months ago

The cutoff-analysis for hmmratac command is discussed here: https://macs3-project.github.io/MACS/docs/hmmratac.html#choices-of-cutoff-values

And a similar cutoff-analysis for callpeak command is discussed here: https://macs3-project.github.io/MACS/docs/callpeak.html#cutoff-analysis

And if you can generate the cutoff analysis report from hmmratac --cutoff-analysis-only and share with me here, we can analyze the report together in this thread.

igordot commented 5 months ago

I am sharing the cutoff report here (I had to add .txt because curiously GitHub does not allow .tsv): test_cutoff_analysis.tsv.txt

It might even be easier to just paste it in whole:

score   npeaks  lpeaks  avelpeak
638.79  1   234 234.00
630.39  1   236 236.00
621.98  1   267 267.00
613.58  1   267 267.00
605.17  1   277 277.00
596.77  1   280 280.00
588.36  1   291 291.00
579.96  1   292 292.00
571.55  3   794 264.67
563.15  2   1972    986.00
554.74  2   2523    1261.50
546.34  3   2902    967.33
537.93  3   2913    971.00
529.53  3   2920    973.33
521.12  3   2925    975.00
512.71  3   2955    985.00
504.31  3   2989    996.33
495.90  3   3183    1061.00
487.50  3   3202    1067.33
479.09  3   3257    1085.67
470.69  3   3407    1135.67
462.28  3   3458    1152.67
453.88  3   3501    1167.00
445.47  3   3505    1168.33
437.07  3   3510    1170.00
428.66  3   3563    1187.67
420.26  3   3639    1213.00
411.85  3   3678    1226.00
403.45  3   3735    1245.00
395.04  3   4062    1354.00
386.64  3   4456    1485.33
378.23  3   4507    1502.33
369.83  3   4511    1503.67
361.42  3   4539    1513.00
353.02  3   4568    1522.67
344.61  3   4577    1525.67
336.21  3   4677    1559.00
327.80  3   4704    1568.00
319.40  3   5206    1735.33
310.99  2   6264    3132.00
302.58  3   6392    2130.67
294.18  3   6401    2133.67
285.77  3   6412    2137.33
277.37  3   6426    2142.00
268.96  3   6467    2155.67
260.56  3   6484    2161.33
252.15  3   6502    2167.33
243.75  3   6508    2169.33
235.34  4   6635    1658.75
226.94  5   6762    1352.40
218.53  5   6785    1357.00
210.13  6   6967    1161.17
201.72  6   6986    1164.33
193.32  6   7039    1173.17
184.91  6   7067    1177.83
176.51  7   7301    1043.00
168.10  7   7333    1047.57
159.70  8   7900    987.50
151.29  8   8338    1042.25
142.89  10  9056    905.60
134.48  13  9942    764.77
126.08  14  11087   791.93
117.67  18  12245   680.28
109.27  23  18535   805.87
100.86  27  25300   937.04
92.46   31  34402   1109.74
84.05   30  45057   1501.90
75.65   31  52352   1688.77
67.24   30  63203   2106.77
58.84   29  73885   2547.76
50.43   26  82657   3179.12
42.03   27  88598   3281.41
33.62   28  92070   3288.21
25.22   46  101435  2205.11
16.81   414 255592  617.37
8.40    14319   7995323 558.37
0.00    64045   2385081806  37240.72
taoliu commented 5 months ago

The fold change values for reasonable peaks are within the lower range. The default parameters for hmmratac are -u 20 -l 10 -c 1.2. It's possible that we need to increase the -c parameter a bit higher. I can't judge since the current npeak, lpeak, and avelpeak values of the score cutoff range 0.00-8.40 show huge difference and we don't have a finer resolution of the analysis between 0 and 8.4. You can try to use -u 20 -l 8 -c 5, and to see if the result can improve. We will try to add an option for fine-tuning this cutoff analysis to get better resolution in the future. At the meantime, we are adding a different emission model -- Poisson emission for HMM, as an addition to the current Gaussian emission model. In our internal test on some small dataset, it gives us more similar results as the java version HMMRATAC.

How is the current HMM from hmmratac? Could you share us the command line output as https://macs3-project.github.io/MACS/docs/hmmratac.html#tune-the-hmm-model?

Jungal10 commented 5 months ago

I ran HMMRATAC on two samples that should be biological replicates. The quality difference between them is high, although the quality is far from ideal for both. However, the low quality sample produced more than 10x as many peaks (<10k versus >100k) with the default settings.

For comparison, the Java version of HMMRATAC produced <1k peaks for the first sample and no peaks for the second sample (score cutoff of 30).

Here is a genome browser screenshot to illustrate the situation: image

The called peaks for the first sample look like peaks. Not so much for the second sample. Is this expected? Are there some guidelines to help avoid this?

Not a direct answer, but the scale on your IGV is quite different as well between the two

igordot commented 5 months ago

Not a direct answer, but the scale on your IGV is quite different as well between the two

Yes, each sample is auto-scaled independently since peak-calling is independent as well.

taoliu commented 5 months ago

Also related to #638, it's highly possible that the hidden markov model learned from the low quality data can't capture the expected chromatin structure around accessible sites. This is a common problem of "machine learning". In this case, perhaps a general peak calling process such as using "callpeak" function is more appropriate.

igordot commented 5 months ago

I am not surprised that machine learning is failing for these samples. I was primarily concerned about the large difference between the Java and MACS3 implementations.

taoliu commented 5 months ago

We are implementing a different emission model for HMM in MACS3 called the Poisson Emission model. During our internal tests, this simple model (compared with the Gaussian Emission model) generated results that were similar to those of the Java implementation. Stay tuned if you are interested in it :)