maxplanck-ie / snakepipes

Customizable workflows based on snakemake and python for the analysis of NGS data
http://snakepipes.readthedocs.io
MIT License
374 stars 85 forks source link

spikeIn sampe scaling factor calculation discrepenses #953

Closed sunta3iouxos closed 4 months ago

sunta3iouxos commented 8 months ago

I am getting different results using the multiBamSummary command on my local PC using the following command that I expected to be the same on the server side, and I am suspecting that this is probably the issue: my command:

multiBamSummary bins --binSize 10000 --blackListFileName /home/tgeorgom/CUT-RUNTools-2.0/assemblies/mm10_gencodeM19_spikesTEST/annotation/blacklist.bed --ignoreDuplicates -p 8 --bamfiles /mnt/c/AP04/split_bam/*host*bam  -out /mnt/c/AP04/mergedBam/deeptools/multiBAM_SPIKE_bin1000.out.npz --scalingFactors /mnt/c/AP04/mergedBam/deeptools/multiBAM_spike_scaling_q2.txt  --minMappingQuality 2

Number of bins found: 273590 the output:

sample  scalingFactor
A006850324_209957_S1_L000_spikein.bam   1.0099
A006850324_209960_S2_L000_spikein.bam   1.1728
A006850324_209962_S3_L000_spikein.bam   0.8376
A006850324_209964_S4_L000_spikein.bam   1.0653
A006850324_209966_S5_L000_spikein.bam   0.8244
A006850324_209968_S6_L000_spikein.bam   0.9989
A006850324_209970_S7_L000_spikein.bam   1.0220
A006850324_209972_S8_L000_spikein.bam   1.0535
A006850324_209974_S9_L000_spikein.bam   1.1026
A006850324_209976_S10_L000_spikein.bam  1.0160
A006850324_209978_S11_L000_spikein.bam  1.0301
A006850324_209980_S12_L000_spikein.bam  1.0347
A006850324_209982_S13_L000_spikein.bam  1.0306
A006850324_209984_S14_L000_spikein.bam  0.9753
A006850324_209986_S15_L000_spikein.bam  0.9549
A006850324_209988_S16_L000_spikein.bam  0.9313
A006850324_209990_S17_L000_spikein.bam  0.8619
A006850324_209992_S18_L000_spikein.bam  1.0217

and this is what I am using on the server side:

ChIP-seq -d /scratch/tgeorgom/AP04/ --useSpikeInForNorm --getSizeFactorsFrom genome --sampleSheet /scratch/tgeorgom/AP04/pSer5POLII.tsv --windowSize 500 --plotFormat pdf mm10_gencodeM19_spikesTEST /scratch/tgeorgom/AP04/PolII_ChIPtype_all.yalm

Number of bins found: 4687

and the output is as follows:

sample  scalingFactor
A006850324_209957_S1_L000   0.9813
A006850324_209960_S2_L000   1.1281
A006850324_209962_S3_L000   0.8268
A006850324_209964_S4_L000   1.0348
A006850324_209966_S5_L000   0.8089
A006850324_209968_S6_L000   0.9749
A006850324_209970_S7_L000   0.9942
A006850324_209972_S8_L000   1.0490
A006850324_209974_S9_L000   1.0823
A006850324_209976_S10_L000  0.9764
A006850324_209978_S11_L000  0.9942
A006850324_209980_S12_L000  1.0081
A006850324_209982_S13_L000  0.9971
A006850324_209984_S14_L000  0.9630
A006850324_209986_S15_L000  0.9409
A006850324_209988_S16_L000  0.9228
A006850324_209990_S17_L000  0.8261
A006850324_209992_S18_L000  1.0030

Using a different --binSize to recapitulate the Number of bins found, did not solved the issue:

multiBamSummary bins --binSize 500000 --ignoreDuplicates -p 8 --bamfiles /mnt/c/AP04/split_bam/*spikein*bam -out /mnt/c/AP04/mergedBam/deeptools/multiBAM_SPIKE_bin1000.out.npz --scalingFactors /mnt/c/AP04/mergedBam/deeptools/multiBAM_spike_scaling_q2Noblack_bin500000.txt  --minMappingQuality 2

Number of bins found: 5518

sample  scalingFactor
A006850324_209957_S1_L000_spikein.bam   1.0093
A006850324_209960_S2_L000_spikein.bam   1.1931
A006850324_209962_S3_L000_spikein.bam   0.8369
A006850324_209964_S4_L000_spikein.bam   1.0730
A006850324_209966_S5_L000_spikein.bam   0.8210
A006850324_209968_S6_L000_spikein.bam   1.0097
A006850324_209970_S7_L000_spikein.bam   1.0212
A006850324_209972_S8_L000_spikein.bam   1.0618
A006850324_209974_S9_L000_spikein.bam   1.1137
A006850324_209976_S10_L000_spikein.bam  1.0174
A006850324_209978_S11_L000_spikein.bam  1.0432
A006850324_209980_S12_L000_spikein.bam  1.0476
A006850324_209982_S13_L000_spikein.bam  1.0343
A006850324_209984_S14_L000_spikein.bam  0.9812
A006850324_209986_S15_L000_spikein.bam  0.9624
A006850324_209988_S16_L000_spikein.bam  0.9431
A006850324_209990_S17_L000_spikein.bam  0.8646
A006850324_209992_S18_L000_spikein.bam  1.0306
katsikora commented 8 months ago

Hi, the default spikein_bin_size for calculating scaling factors from spikein genome is 1000 . This should be visible in the workflow config.yaml in the output folder.

In the first commandline you pasted, I've noticed you are passing host bam files to calculate spikein size factors. Then you're citing size factors from a file, in which they appear to be calculated on spikein bam files. Could you revisit that?

If you run snakePipes with --verbose, full shell commands will be returned in the log file. You can then see the full multiBamSummary command with all the parameters passed. Would that be helpful?

Best,

Katarzyna

sunta3iouxos commented 8 months ago

In the first commandline you pasted, I've noticed you are passing host bam files to calculate spikein size factors. Then you're citing size factors from a file, in which they appear to be calculated on spikein bam files. Could you revisit that?

that was a typo from my side, I will come back after testing a few things