MACS2 segmentation fault due to chromosome names

jnktsj commented 7 years ago

Hi, I just wanted to let you know that MACS2 failed to finish when reference chromosome names contain symbols (e.g. 2L52.1a:gene=WBGene00007063, ZK856.2:type=Transposon-mRNA:gene=WBGene00023448, etc). It stops right after #3 Pre-compute pvalue-qvalue table. I used BED for the run.

taoliu commented 7 years ago

Hi @jnktsj, I just tested a BED file with about 200k lines and saw no problem. I gave a weird chromosome name prefix '123abc: +-_[]{})(\/:.,?!^='. Could you share me your BED file?

(TLPython)tliu4@srv-u10-28:1$macs2 --version
macs2 2.1.1.20160309
(TLPython)tliu4@srv-u10-28:1$head test.bed
123abc: +-_[]{})(\/:.,?!^=1 564917  565017  .   .   +
123abc: +-_[]{})(\/:.,?!^=1 565533  565633  .   .   +
123abc: +-_[]{})(\/:.,?!^=1 565844  565944  .   .   +
123abc: +-_[]{})(\/:.,?!^=1 565858  565958  .   .   +
123abc: +-_[]{})(\/:.,?!^=1 566132  566232  .   .   +
123abc: +-_[]{})(\/:.,?!^=1 566212  566312  .   .   +
123abc: +-_[]{})(\/:.,?!^=1 566736  566836  .   .   +
123abc: +-_[]{})(\/:.,?!^=1 566976  567076  .   .   +
123abc: +-_[]{})(\/:.,?!^=1 567211  567311  .   .   +
123abc: +-_[]{})(\/:.,?!^=1 567320  567420  .   .   +
(TLPython)tliu4@srv-u10-28:1$macs2 callpeak -t test.bed
INFO  @ Wed, 22 Mar 2017 12:07:28:
# Command line: callpeak -t test.bed
# ARGUMENTS LIST:
# name = NA
# format = AUTO
# ChIP-seq file = ['test.bed']
# control file = None
# effective genome size = 2.70e+09
# band width = 300
# model fold = [5, 50]
# qvalue cutoff = 5.00e-02
# Larger dataset will be scaled towards smaller dataset.
# Range for calculating regional lambda is: 10000 bps
# Broad region calling is off
# Paired-End mode is off

INFO  @ Wed, 22 Mar 2017 12:07:28: #1 read tag files...
INFO  @ Wed, 22 Mar 2017 12:07:28: #1 read treatment tags...
INFO  @ Wed, 22 Mar 2017 12:07:28: Detected format is: BED
INFO  @ Wed, 22 Mar 2017 12:07:29: #1 tag size is determined as 100 bps
INFO  @ Wed, 22 Mar 2017 12:07:29: #1 tag size = 100
INFO  @ Wed, 22 Mar 2017 12:07:29: #1  total tags in treatment: 199977
INFO  @ Wed, 22 Mar 2017 12:07:29: #1 user defined the maximum tags...
INFO  @ Wed, 22 Mar 2017 12:07:29: #1 filter out redundant tags at the same location and the same strand by allowing at most 1 tag(s)
INFO  @ Wed, 22 Mar 2017 12:07:29: #1  tags after filtering in treatment: 199583
INFO  @ Wed, 22 Mar 2017 12:07:29: #1  Redundant rate of treatment: 0.00
INFO  @ Wed, 22 Mar 2017 12:07:29: #1 finished!
INFO  @ Wed, 22 Mar 2017 12:07:29: #2 Build Peak Model...
INFO  @ Wed, 22 Mar 2017 12:07:29: #2 looking for paired plus/minus strand peaks...
INFO  @ Wed, 22 Mar 2017 12:07:29: #2 number of paired peaks: 4488
INFO  @ Wed, 22 Mar 2017 12:07:29: start model_add_line...
INFO  @ Wed, 22 Mar 2017 12:07:29: start X-correlation...
INFO  @ Wed, 22 Mar 2017 12:07:29: end of X-cor
INFO  @ Wed, 22 Mar 2017 12:07:29: #2 finished!
INFO  @ Wed, 22 Mar 2017 12:07:29: #2 predicted fragment length is 254 bps
INFO  @ Wed, 22 Mar 2017 12:07:29: #2 alternative fragment length(s) may be 254 bps
INFO  @ Wed, 22 Mar 2017 12:07:29: #2.2 Generate R script for model : NA_model.r
INFO  @ Wed, 22 Mar 2017 12:07:29: #3 Call peaks...
INFO  @ Wed, 22 Mar 2017 12:07:29: #3 Pre-compute pvalue-qvalue table...
INFO  @ Wed, 22 Mar 2017 12:07:30: #3 Call peaks for each chromosome...
INFO  @ Wed, 22 Mar 2017 12:07:30: #4 Write output xls file... NA_peaks.xls
INFO  @ Wed, 22 Mar 2017 12:07:30: #4 Write peak in narrowPeak format file... NA_peaks.narrowPeak
INFO  @ Wed, 22 Mar 2017 12:07:30: #4 Write summits bed file... NA_summits.bed
INFO  @ Wed, 22 Mar 2017 12:07:30: Done!
(TLPython)tliu4@srv-u10-28:1$head NA_peaks.narrowPeak
123abc: +-_[]{})(\/:.,?!^=1 919428  919740  NA_peak_1   40  .   5.20653 7.81607 4.03555 121
123abc: +-_[]{})(\/:.,?!^=1 1185654 1185908 NA_peak_2   37  .   5.09424 7.42382 3.73522 125
123abc: +-_[]{})(\/:.,?!^=1 1307407 1307661 NA_peak_3   40  .   5.20653 7.81607 4.03555 193
123abc: +-_[]{})(\/:.,?!^=1 2126579 2126833 NA_peak_4   28  .   4.33877 6.21931 2.81285 68
123abc: +-_[]{})(\/:.,?!^=1 2246572 2246949 NA_peak_5   37  .   5.09424 7.42382 3.73522 139
123abc: +-_[]{})(\/:.,?!^=1 2313278 2313532 NA_peak_6   37  .   5.09424 7.42382 3.73522 74
123abc: +-_[]{})(\/:.,?!^=1 2345913 2346167 NA_peak_7   58  .   6.64894 10.22048    5.85201 114
123abc: +-_[]{})(\/:.,?!^=1 2479551 2479806 NA_peak_8   61  .   7.03455 10.65279    6.16741 106
123abc: +-_[]{})(\/:.,?!^=1 3640928 3641354 NA_peak_9   98  .   9.41347 15.39576    9.89543 205
123abc: +-_[]{})(\/:.,?!^=1 3773721 3774031 NA_peak_10  61  .   6.79233 10.67463    6.16741 154

taoliu commented 7 years ago

My suspect is that the problem is not caused by chromosome name, but the number of "chromosomes". The current hash table used in MACS2 is not optimized for large number of "chromosomes". To save computational speed, each chromosome will have minimum 100K data points initially, so if you have 50k 'chromosomes', there will be a minimum number of 5G datapoints. If you multiply the number with the size of data structure, the final memory usage will be over the physical memory size of your computer. To tweak this, go to find the 'Constants.py' file inside MACS2 source code directory and modify the 'BUFFER_SIZE' to a small number such as 100 than install MACS2 again. Don't forget to remove the old installation entirely.

macs3-project / MACS

MACS2 segmentation fault due to chromosome names #170