macs3-project / MACS

MACS -- Model-based Analysis of ChIP-Seq
https://macs3-project.github.io/MACS/
BSD 3-Clause "New" or "Revised" License
698 stars 268 forks source link

Bug: variable overflow in peak column of narrowPeak file #500

Open diego-rt opened 2 years ago

diego-rt commented 2 years ago

Describe the bug Hi, for some reason we are getting negative values in the peak column of the narrowPeak file from our data (see image attached). This is probably due to the size of our chromosomes (32 gbp genome) and smells to me like a variable overflow when computing the peak summits.

To Reproduce macs3 callpeak -f BAMPE -t ${dedupBAM} -g 2.0e10 -n \${name} -B --SPMR --seed 23 -q 0.01 --call-summits --keep-dup all

Expected behavior I believe these are supposed to be positive values.

Screenshots

Screenshot 2022-02-01 at 20 11 22

System (please complete the following information): Linux, macs3a6

taoliu commented 2 years ago

Right... It's an overflow. Could you share with me the first ten rows of the narrowPeak file from MACS?

diego-rt commented 2 years ago

Sure! Here it goes:

image

And one last thing: we are really interested in using HMMRATAC with our ATAC-seq data but unfortunately it doesn't seem to work with CSI indexes, which are necessary because of the size of our chromosomes. Do you have any advice for how could we get it to work? I assume the issue comes from the htsjdk version used (I believe support for csi indexes was introduced around version 2.19.0), but I'm not really sure whether there might be additional hurdles. Are you guys planning to upgrade it to a more recent version of htsjdk? We would be happy to help with testing with huge chromosomes or with anything else that might help.

Thanks a lot for your help!

taoliu commented 2 years ago

@diego-rt The first ten rows seem fine to me. Will you see negative values in the last column of the narrowPeak file? Also, the chromosome names (C00????) seem different from the message from R (e.g. chr10p).

Regarding HMMRATAC, we are working on implementing it in MACS3 with Python in a separate branch right now. Thanks for the information of the CSI index! We will work on that.

diego-rt commented 2 years ago

Apologies for the delayed reply!

Oh that's great to hear about HMMRATAC! Do you have an estimated release timeframe in mind?

Yes, the chromosome names are different becase some are chromosome scale scaffolds and some are unplaced contigs (CXXXXXXX), but they are all part of the same genome reference: Axolotl genome v6.

Here go the lowest 1500 scores on the 10th column:

macs3_Debug_output.txt

taoliu commented 2 years ago

@diego-rt Thanks! When I wrote MACS, I have an assumption that a single chromosome can't exceed 2G... It turns out I need to modify a lot of variable types. It will take some time.

Regarding HMMRATAC in MACS, we are aiming to release it before June. But don't think it will solve the long chromosome problem unless I modify the variable types from int to long which are used to record the position of each read. Let me figure it out first...