maxplanck-ie / snakepipes

Customizable workflows based on snakemake and python for the analysis of NGS data
http://snakepipes.readthedocs.io
MIT License
374 stars 85 forks source link

MACS2 not using proper effective genome size #950

Closed sunta3iouxos closed 9 months ago

sunta3iouxos commented 9 months ago

Hello again,

This is my command:

ChIP-seq -d /scratch/tgeorgom/AP04/ --useSpikeInForNorm --getSizeFactorsFrom genome --peakCallerOptions "-f BAMPE" --keepTemp --sampleSheet /scratch/tgeorgom/AP04/pSer5POLII.tsv --windowSize 500 --plotFormat "pdf" mm10_gencodeM19_spikesTEST /scratch/tgeorgom/AP04/PolII_ChIPtype_all.yalm

and this is the command in the MACS2 log files:

callpeak -t split_bam/A006850324_209957_S1_L000_host.bam -f BAM --mfold 0 50 -g 2652783500 --nomodel --extsize 238 --keep-dup all --outdir MACS2 --name A006850324_209957_S1_L000_host.BAM -f BAMPE

The genome is set as mm10 that has an effective size of 1,6m reads and fere the -g is set to the human one

katsikora commented 9 months ago

Can you check what the genome_size is in the organism yaml in your output folder?

e.g. grep genome_size /scratch/tgeorgom/AP04/ChIP-seq_organism.yaml

According to https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001635.20/ , mm10 genome size is about 2.7Gb, which is similar to the number that's passed to MACS2 in the commandline you pasted above.

sunta3iouxos commented 9 months ago

Can you check what the genome_size is in the organism yaml in your output folder?

e.g. grep genome_size /scratch/tgeorgom/AP04/ChIP-seq_organism.yaml

According to https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_000001635.20/ , mm10 genome size is about 2.7Gb, which is similar to the number that's passed to MACS2 in the commandline you pasted above.

this is correct, but in the MACS2 site there is the following:

-g/--gsize

PLEASE assign this parameter to fit your needs!

It's the mappable genome size or effective genome size which is defined as the genome size which can be sequenced. Because of the repetitive features on the chromosomes, the actual mappable genome size will be smaller than the original size, about 90% or 70% of the genome size. The default hs -- 2.7e9 is recommended for human genome. Here are all precompiled parameters for effective genome size:

hs: 2.7e9
mm: 1.87e9
ce: 9e7
dm: 1.2e8

Users may want to use k-mer tools to simulate mapping of Xbps long reads to target genome, and to find the ideal effective genome size. However, usually by taking away the simple repeats and Ns from the total genome, one can get an approximate number of effective genome size. A slight difference in the number won't cause a big difference of peak calls, because this number is used to estimate a genome-wide noise level which is usually the least significant one compared with the local biases modeled by MACS.

Any idea?

katsikora commented 9 months ago

Oh I see. We pass the same genome size to MACS2 as we would to deeptools commands. The deeptools recommendations for genome sizes state:


Genome | Effective size
-- | --
GRCh37 | 2864785220
GRCh38 | 2913022398
GRCm37 | 2620345972
GRCm38 | 2652783500
dm3 | 162367812
dm6 | 142573017
GRCz10 | 1369631918
WBcel235 | 100286401
TAIR10 | 119481543

I'm not sure why the drastic difference for the mouse genome between MACS2 and deeptools. I think it would be good to cross-check with MACS2 authors.

sunta3iouxos commented 9 months ago

I'm not sure why the drastic difference for the mouse genome between MACS2 and deeptools. I think it would be good to cross-check with MACS2 authors.

I will. thank you for your comment. Your effective size is more reasonable as also described in the -gsize part.