Open jnktsj opened 7 years ago
Hi @jnktsj, I just tested a BED file with about 200k lines and saw no problem. I gave a weird chromosome name prefix '123abc: +-_[]{})(\/:.,?!^='. Could you share me your BED file?
(TLPython)tliu4@srv-u10-28:1$macs2 --version macs2 2.1.1.20160309 (TLPython)tliu4@srv-u10-28:1$head test.bed 123abc: +-_[]{})(\/:.,?!^=1 564917 565017 . . + 123abc: +-_[]{})(\/:.,?!^=1 565533 565633 . . + 123abc: +-_[]{})(\/:.,?!^=1 565844 565944 . . + 123abc: +-_[]{})(\/:.,?!^=1 565858 565958 . . + 123abc: +-_[]{})(\/:.,?!^=1 566132 566232 . . + 123abc: +-_[]{})(\/:.,?!^=1 566212 566312 . . + 123abc: +-_[]{})(\/:.,?!^=1 566736 566836 . . + 123abc: +-_[]{})(\/:.,?!^=1 566976 567076 . . + 123abc: +-_[]{})(\/:.,?!^=1 567211 567311 . . + 123abc: +-_[]{})(\/:.,?!^=1 567320 567420 . . + (TLPython)tliu4@srv-u10-28:1$macs2 callpeak -t test.bed INFO @ Wed, 22 Mar 2017 12:07:28: # Command line: callpeak -t test.bed # ARGUMENTS LIST: # name = NA # format = AUTO # ChIP-seq file = ['test.bed'] # control file = None # effective genome size = 2.70e+09 # band width = 300 # model fold = [5, 50] # qvalue cutoff = 5.00e-02 # Larger dataset will be scaled towards smaller dataset. # Range for calculating regional lambda is: 10000 bps # Broad region calling is off # Paired-End mode is off INFO @ Wed, 22 Mar 2017 12:07:28: #1 read tag files... INFO @ Wed, 22 Mar 2017 12:07:28: #1 read treatment tags... INFO @ Wed, 22 Mar 2017 12:07:28: Detected format is: BED INFO @ Wed, 22 Mar 2017 12:07:29: #1 tag size is determined as 100 bps INFO @ Wed, 22 Mar 2017 12:07:29: #1 tag size = 100 INFO @ Wed, 22 Mar 2017 12:07:29: #1 total tags in treatment: 199977 INFO @ Wed, 22 Mar 2017 12:07:29: #1 user defined the maximum tags... INFO @ Wed, 22 Mar 2017 12:07:29: #1 filter out redundant tags at the same location and the same strand by allowing at most 1 tag(s) INFO @ Wed, 22 Mar 2017 12:07:29: #1 tags after filtering in treatment: 199583 INFO @ Wed, 22 Mar 2017 12:07:29: #1 Redundant rate of treatment: 0.00 INFO @ Wed, 22 Mar 2017 12:07:29: #1 finished! INFO @ Wed, 22 Mar 2017 12:07:29: #2 Build Peak Model... INFO @ Wed, 22 Mar 2017 12:07:29: #2 looking for paired plus/minus strand peaks... INFO @ Wed, 22 Mar 2017 12:07:29: #2 number of paired peaks: 4488 INFO @ Wed, 22 Mar 2017 12:07:29: start model_add_line... INFO @ Wed, 22 Mar 2017 12:07:29: start X-correlation... INFO @ Wed, 22 Mar 2017 12:07:29: end of X-cor INFO @ Wed, 22 Mar 2017 12:07:29: #2 finished! INFO @ Wed, 22 Mar 2017 12:07:29: #2 predicted fragment length is 254 bps INFO @ Wed, 22 Mar 2017 12:07:29: #2 alternative fragment length(s) may be 254 bps INFO @ Wed, 22 Mar 2017 12:07:29: #2.2 Generate R script for model : NA_model.r INFO @ Wed, 22 Mar 2017 12:07:29: #3 Call peaks... INFO @ Wed, 22 Mar 2017 12:07:29: #3 Pre-compute pvalue-qvalue table... INFO @ Wed, 22 Mar 2017 12:07:30: #3 Call peaks for each chromosome... INFO @ Wed, 22 Mar 2017 12:07:30: #4 Write output xls file... NA_peaks.xls INFO @ Wed, 22 Mar 2017 12:07:30: #4 Write peak in narrowPeak format file... NA_peaks.narrowPeak INFO @ Wed, 22 Mar 2017 12:07:30: #4 Write summits bed file... NA_summits.bed INFO @ Wed, 22 Mar 2017 12:07:30: Done! (TLPython)tliu4@srv-u10-28:1$head NA_peaks.narrowPeak 123abc: +-_[]{})(\/:.,?!^=1 919428 919740 NA_peak_1 40 . 5.20653 7.81607 4.03555 121 123abc: +-_[]{})(\/:.,?!^=1 1185654 1185908 NA_peak_2 37 . 5.09424 7.42382 3.73522 125 123abc: +-_[]{})(\/:.,?!^=1 1307407 1307661 NA_peak_3 40 . 5.20653 7.81607 4.03555 193 123abc: +-_[]{})(\/:.,?!^=1 2126579 2126833 NA_peak_4 28 . 4.33877 6.21931 2.81285 68 123abc: +-_[]{})(\/:.,?!^=1 2246572 2246949 NA_peak_5 37 . 5.09424 7.42382 3.73522 139 123abc: +-_[]{})(\/:.,?!^=1 2313278 2313532 NA_peak_6 37 . 5.09424 7.42382 3.73522 74 123abc: +-_[]{})(\/:.,?!^=1 2345913 2346167 NA_peak_7 58 . 6.64894 10.22048 5.85201 114 123abc: +-_[]{})(\/:.,?!^=1 2479551 2479806 NA_peak_8 61 . 7.03455 10.65279 6.16741 106 123abc: +-_[]{})(\/:.,?!^=1 3640928 3641354 NA_peak_9 98 . 9.41347 15.39576 9.89543 205 123abc: +-_[]{})(\/:.,?!^=1 3773721 3774031 NA_peak_10 61 . 6.79233 10.67463 6.16741 154
My suspect is that the problem is not caused by chromosome name, but the number of "chromosomes". The current hash table used in MACS2 is not optimized for large number of "chromosomes". To save computational speed, each chromosome will have minimum 100K data points initially, so if you have 50k 'chromosomes', there will be a minimum number of 5G datapoints. If you multiply the number with the size of data structure, the final memory usage will be over the physical memory size of your computer. To tweak this, go to find the 'Constants.py' file inside MACS2 source code directory and modify the 'BUFFER_SIZE' to a small number such as 100 than install MACS2 again. Don't forget to remove the old installation entirely.
Hi, I just wanted to let you know that MACS2 failed to finish when reference chromosome names contain symbols (e.g.
2L52.1a:gene=WBGene00007063
,ZK856.2:type=Transposon-mRNA:gene=WBGene00023448
, etc). It stops right after#3 Pre-compute pvalue-qvalue table
. I used BED for the run.