Open igordot opened 6 months ago
One thing you can try is to increase the cutoff for calling candidate regions in the macs3 version of hmmratac
using the option for prescan-cutoff -c
. And perhaps first run the --cutoff-analysis
in hmmratac
to decide the OK parameter for -c
. Also it is likely that the cutoff values for -u
and -l
for training HMM have to be adjusted as well. Check https://macs3-project.github.io/MACS/docs/hmmratac.html#choices-of-cutoff-values
Please let me know if this can help.
I saw that, but wasn't sure how to properly use the output to determine the optimal values. Do you have an example documented somewhere?
The cutoff-analysis for hmmratac
command is discussed here: https://macs3-project.github.io/MACS/docs/hmmratac.html#choices-of-cutoff-values
And a similar cutoff-analysis for callpeak
command is discussed here: https://macs3-project.github.io/MACS/docs/callpeak.html#cutoff-analysis
And if you can generate the cutoff analysis report from hmmratac --cutoff-analysis-only
and share with me here, we can analyze the report together in this thread.
I am sharing the cutoff report here (I had to add .txt because curiously GitHub does not allow .tsv): test_cutoff_analysis.tsv.txt
It might even be easier to just paste it in whole:
score npeaks lpeaks avelpeak
638.79 1 234 234.00
630.39 1 236 236.00
621.98 1 267 267.00
613.58 1 267 267.00
605.17 1 277 277.00
596.77 1 280 280.00
588.36 1 291 291.00
579.96 1 292 292.00
571.55 3 794 264.67
563.15 2 1972 986.00
554.74 2 2523 1261.50
546.34 3 2902 967.33
537.93 3 2913 971.00
529.53 3 2920 973.33
521.12 3 2925 975.00
512.71 3 2955 985.00
504.31 3 2989 996.33
495.90 3 3183 1061.00
487.50 3 3202 1067.33
479.09 3 3257 1085.67
470.69 3 3407 1135.67
462.28 3 3458 1152.67
453.88 3 3501 1167.00
445.47 3 3505 1168.33
437.07 3 3510 1170.00
428.66 3 3563 1187.67
420.26 3 3639 1213.00
411.85 3 3678 1226.00
403.45 3 3735 1245.00
395.04 3 4062 1354.00
386.64 3 4456 1485.33
378.23 3 4507 1502.33
369.83 3 4511 1503.67
361.42 3 4539 1513.00
353.02 3 4568 1522.67
344.61 3 4577 1525.67
336.21 3 4677 1559.00
327.80 3 4704 1568.00
319.40 3 5206 1735.33
310.99 2 6264 3132.00
302.58 3 6392 2130.67
294.18 3 6401 2133.67
285.77 3 6412 2137.33
277.37 3 6426 2142.00
268.96 3 6467 2155.67
260.56 3 6484 2161.33
252.15 3 6502 2167.33
243.75 3 6508 2169.33
235.34 4 6635 1658.75
226.94 5 6762 1352.40
218.53 5 6785 1357.00
210.13 6 6967 1161.17
201.72 6 6986 1164.33
193.32 6 7039 1173.17
184.91 6 7067 1177.83
176.51 7 7301 1043.00
168.10 7 7333 1047.57
159.70 8 7900 987.50
151.29 8 8338 1042.25
142.89 10 9056 905.60
134.48 13 9942 764.77
126.08 14 11087 791.93
117.67 18 12245 680.28
109.27 23 18535 805.87
100.86 27 25300 937.04
92.46 31 34402 1109.74
84.05 30 45057 1501.90
75.65 31 52352 1688.77
67.24 30 63203 2106.77
58.84 29 73885 2547.76
50.43 26 82657 3179.12
42.03 27 88598 3281.41
33.62 28 92070 3288.21
25.22 46 101435 2205.11
16.81 414 255592 617.37
8.40 14319 7995323 558.37
0.00 64045 2385081806 37240.72
The fold change values for reasonable peaks are within the lower range. The default parameters for hmmratac
are -u 20 -l 10 -c 1.2
. It's possible that we need to increase the -c
parameter a bit higher. I can't judge since the current npeak, lpeak, and avelpeak values of the score cutoff range 0.00-8.40 show huge difference and we don't have a finer resolution of the analysis between 0 and 8.4. You can try to use -u 20 -l 8 -c 5
, and to see if the result can improve. We will try to add an option for fine-tuning this cutoff analysis to get better resolution in the future. At the meantime, we are adding a different emission model -- Poisson emission for HMM, as an addition to the current Gaussian emission model. In our internal test on some small dataset, it gives us more similar results as the java version HMMRATAC.
How is the current HMM from hmmratac
? Could you share us the command line output as https://macs3-project.github.io/MACS/docs/hmmratac.html#tune-the-hmm-model?
I ran HMMRATAC on two samples that should be biological replicates. The quality difference between them is high, although the quality is far from ideal for both. However, the low quality sample produced more than 10x as many peaks (<10k versus >100k) with the default settings.
For comparison, the Java version of HMMRATAC produced <1k peaks for the first sample and no peaks for the second sample (score cutoff of 30).
Here is a genome browser screenshot to illustrate the situation:
The called peaks for the first sample look like peaks. Not so much for the second sample. Is this expected? Are there some guidelines to help avoid this?
Not a direct answer, but the scale on your IGV is quite different as well between the two
Not a direct answer, but the scale on your IGV is quite different as well between the two
Yes, each sample is auto-scaled independently since peak-calling is independent as well.
Also related to #638, it's highly possible that the hidden markov model learned from the low quality data can't capture the expected chromatin structure around accessible sites. This is a common problem of "machine learning". In this case, perhaps a general peak calling process such as using "callpeak" function is more appropriate.
I am not surprised that machine learning is failing for these samples. I was primarily concerned about the large difference between the Java and MACS3 implementations.
We are implementing a different emission model for HMM in MACS3 called the Poisson Emission model. During our internal tests, this simple model (compared with the Gaussian Emission model) generated results that were similar to those of the Java implementation. Stay tuned if you are interested in it :)
I ran HMMRATAC on two samples that should be biological replicates. The quality difference between them is high, although the quality is far from ideal for both. However, the low quality sample produced more than 10x as many peaks (<10k versus >100k) with the default settings.
For comparison, the Java version of HMMRATAC produced <1k peaks for the first sample and no peaks for the second sample (score cutoff of 30).
Here is a genome browser screenshot to illustrate the situation:
The called peaks for the first sample look like peaks. Not so much for the second sample. Is this expected? Are there some guidelines to help avoid this?