uio-bmi / graph_peak_caller

ChIP-seq peak caller for reads mapped to a graph-based reference genome
BSD 3-Clause "New" or "Revised" License
18 stars 8 forks source link

Graph peak caller and MACS2 - Different peak numbers #4

Closed cgroza closed 5 years ago

cgroza commented 5 years ago

Hi,

We generated graph genomes from NA12878 variant calls and alligned histone ChIP-seq reads.

We also succesfully called H3K4me1 peaks using both MACS2 and Graph Peak caller. However, I notice that Graph Peak Caller calls approximatively 30 000 fewer peaks than MACS2.

I tried to match the parameters as best as I could, using the same fragment length (265 bp), the same genome size, the same control sample.

I also ran MACS2 without any extra argument (generates background model, calls narrow peaks).

I still get ~137 000 peaks from MACS2 and only 106 000 peaks from Graph Peak Caller.

Could you provide me with any clues as to why this might happen? Is there a way I am supposed to run MACS2 to get similar results to Graph Peak Caller?

My thanks, Cristian Groza Bourque Lab, McGill University

ivargr commented 5 years ago

Hi, and thanks for asking!

There could be a few different reasons for this. But just to clarify one thing first: You mention fragment length 265. Is this the fragment length that MACS2 predicts, or is this a fragment length that you know from before (that the sample follows)? Am I understanding correctly that you specified this fragment length when running MACS2 and Graph Peak Caller (and did not let MACS2 predict the fragment length itself)?

cgroza commented 5 years ago

Hi Thanks for the quick reply. The fragment length is predicted by macs2. I am giving it as a parameter to graph peak caller.

ivargr commented 5 years ago

I see, then they should be using the same fragment length.

Are you following this guide (https://github.com/uio-bmi/graph_peak_caller/wiki/Graph-based-ChIP-seq-tutorial) to do the peak calling? If so, are you counting unique reads the way that is suggested on that page, and sending that number in by using the -u flag on graph_peak_caller callpeaks. If you are, then I just realized that the specified unix command given on that page for counting number of unique reads will give slightly wrong uniqe read count (we have a new way of counting unique reads -- that page has unfortunately not been updated), so if you are using that way to count unique reads, that might be a problem.

cgroza commented 5 years ago

Yes. I am using that tutorial and counting reads their way. Also, do I also have to count the control reads also?

Could you point me to the correct way of dealing with this?

Thank you for your expertise.

-------- Original Message -------- On Jan 25, 2019, 4:44 PM, ivargr wrote:

I see, then they should be using the same fragment length.

Are you following this guide (https://github.com/uio-bmi/graph_peak_caller/wiki/Graph-based-ChIP-seq-tutorial) to do the peak calling? If so, are you counting unique reads the way that is suggested on that page, and sending that number in by using the -u flag on graph_peak_caller callpeaks. If you are, then I just realized that the specified unix command given on that page for counting number of unique reads will give slightly wrong uniqe read count (we have a new way of counting unique reads -- that page has unfortunately not been updated), so if you are using that way to count unique reads, that might be a problem.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

ivargr commented 5 years ago

Okay, that command specified there for counting reads unfortunately gives slightly wrong read count, which in turn makes the control track a bit too high, which I think is the reason why you get fewer peaks (sorry for this mistake, we have forgotten to update that page, I'll will update it soon).

The correct command is this, can you try using that instead?:

graph_peak_caller count_unique_reads $chromosomes $graph_dir/ filtered 

This command will take some time to run, and print progress while running. The number of unique reads will be printed in the end. Could you try using this number after -u when running graph peak caller (the new number for unique reads should then be slightly different than what you have been using before). Could you give that a try and let me know if you end up with more peaks? You will probably not end up with exact same number of peaks as MACS2, but you get something closer.

PS: Let me know if you don't get the above command to work (it might simply be that I specified other file name than you have)

cgroza commented 5 years ago

Hi, So I had a chance to apply your advice. Counting the unique reads with graph peak caller gives around 43027522 unique reads (very close to MACS2 tags after filtering) as opposed to the previous ~53000000 from sort | uniq. The current number of peaks from graph peak caller rose to ~121 000. There is still a ~16K difference but this is much smaller than previously.

One thing I note in the MACS2 filtering steps is that it filters the control tags too. I notice that we do not count control unique reads or filter them in any fashion. Is there a similar counterpart in graph peak caller? Maybe this could explain the remaining difference?

ivargr commented 5 years ago

Glad to see that it made a difference!

The control alignments should also be automatically filterered by Graph Peak Caller, so that there are only unique alignments left (i.e. no two alignments starting on the same base pair on the same strand). One can choose not to filter them, but then a warning should be printed (as shown on this line https://github.com/uio-bmi/graph_peak_caller/blob/65df85d8caaa78cab83760c618801a784c0fa8ff/graph_peak_caller/multiplegraphscallpeaks.py#L101). So if you don't see a warning like that printed, they should have been filtered (which is the default option).

A difference of 16k peaks on that many peaks may be natural. We also often observed that Graph Peak Caller would call fewer peaks even though it is doing very similar things to MACS2. The reason for this is a slight difference in how the local control values are calcluated, where the local control values typically end up a bit higher than MACS2 (if I remember correctly). Ending up with fewer peaks should really not be a big problem, since Graph Peak Caller will then only miss the "weaker" peaks (with lowest log q value), so you can tune this effect by specifying the -q parameter on callpeaks_whole_genome_from_p_values a bit lower (I think default is 0.05 which is the same as MACS2).

So my bet is that 16k fewer peaks now doesn't necessarily mean that anything is wrong, or that you have done anything wrong, but it could be useful to compare the peaks you have found with those found by MACS2 to see how the two methods agree with each other. This can be done by following the steps under "Compare your peaks with peaks found by Macs2" in the peak calling guide you have followed. You will then get the number of peaks found uniquely by each and by both methods. That guide assumes you have your MACS peaks in a file called peaks.narrowPeak. You can e.g. create this file by choosing the top 121k peaks from you original macs peaks file, which I think is already sorted on q value (so you can do head -n 121000 original_peaks.narrowPeak > peaks.narrowPeak`). That way, you can see if Graph Peak Caller is missing any of the high-scoring peaks that MACS2 finds (and vice versa).

Let me know if you try doing this and get stuck, and I'll try to help!