Closed danledinh closed 3 years ago
Specifically, I am wondering how I can create a genomic features file to be compatible with an MNase-based protocol that does not use restriction enzymes.
we recommend using bin level GC content and mappability features, and removing the effective length feature. I'm modifying the code for genomic feature creation to generate these files. Hope to have the code along with precomputed mm10 and hg19 files later today or tomorrow. Sorry about the delay in responding.
I have added the features files for hg19, hg38, and mm10. I've also modified the code so you can use it to generate these files for other genomes. script is here
Thanks for providing the update!
I re-ran my analysis with both regular hg19 and hg19_mnase genomic features. With the regular hg19, everything goes to completion as expected. However, with the hg19_mnase, the MAPS module does not produce *.MAPS2_pospoisson files. Therefore, none of the downstream files are created, such as significant interactions bed file and the WashU track files.
I am using the Arima recommended settings:
plot=1 # "1" to generate plots, "0" to skip
generate_hic=1 # "1" to generate a .hic file for visualization with Juicer, "0" to skip
bin_size=5000 # Resolution of the loops called.
binning_range=2000000 # Maximum distance for loop calling
fdr=2 # Do not change! False Discovery Rate threshold 1 = 0.1, 2 = 0.01, 3 = 0.001, ect...
filter_file="None"
mapq=30 # Phred scaled mapping quality threshold
length_cutoff=1000 # Minimum genomic distance
model="pospoisson" # Regression Model for Loop calling. Options are: "pospoisson" and "negbinom"
sex_chroms_to_process="NA" # Which Sex Chromosomes should be processed
optical_duplicate_distance=$([ $patterned_flowcell == 1 ] && echo 2500 || echo 100) # v1.9_update
Can you also share the log files from the preprocessing (should be here:
Here is the feather log:
Sat Oct 24 09:49:21 2020 starting mapping and filtering operation
Sat Oct 24 09:49:21 2020 calling bwa for /gstore/data/dnaseq/processed_runs/R6516/results/LIB5435096_HITS5455050_R1.fastq.gz
Sat Oct 24 09:56:35 2020 calling bwa for /gstore/data/dnaseq/processed_runs/R6516/results/LIB5435096_HITS5455050_R2.fastq.gz
Sat Oct 24 10:01:47 2020 calling samtools sort for /gnet/is6/p04/data/dnaseq/analysis/led13/outputs/R6516_shallow_MAPS_external_1kBin_rerun_mnase/feather_output/LIB5435096_HITS5455050_20201024_094918//tempfiles/LIB5435096_HITS5455050_R1.fastq.gz.bwa.sam storing in /gnet/is6/p04/data/dnaseq/analysis/led13/outputs/R6516_shallow_MAPS_external_1kBin_rerun_mnase/feather_output/LIB5435096_HITS5455050_20201024_094918//tempfiles/LIB5435096_HITS5455050_R1.fastq.gz.bwa.sam.srtn
Sat Oct 24 10:02:02 2020 calling samtools sort for /gnet/is6/p04/data/dnaseq/analysis/led13/outputs/R6516_shallow_MAPS_external_1kBin_rerun_mnase/feather_output/LIB5435096_HITS5455050_20201024_094918//tempfiles/LIB5435096_HITS5455050_R2.fastq.gz.bwa.sam storing in /gnet/is6/p04/data/dnaseq/analysis/led13/outputs/R6516_shallow_MAPS_external_1kBin_rerun_mnase/feather_output/LIB5435096_HITS5455050_20201024_094918//tempfiles/LIB5435096_HITS5455050_R2.fastq.gz.bwa.sam.srtn
Sat Oct 24 10:02:17 2020 merging /gnet/is6/p04/data/dnaseq/analysis/led13/outputs/R6516_shallow_MAPS_external_1kBin_rerun_mnase/feather_output/LIB5435096_HITS5455050_20201024_094918//tempfiles/LIB5435096_HITS5455050_R1.fastq.gz.bwa.sam.srtn and /gnet/is6/p04/data/dnaseq/analysis/led13/outputs/R6516_shallow_MAPS_external_1kBin_rerun_mnase/feather_output/LIB5435096_HITS5455050_20201024_094918//tempfiles/LIB5435096_HITS5455050_R2.fastq.gz.bwa.sam.srtn
Sat Oct 24 10:03:31 2020 filtering and pairing reads
Sat Oct 24 10:05:28 2020 paired bam file generated. Calling fixmate
/gnet/is6/p04/data/dnaseq/analysis/led13/outputs/R6516_shallow_MAPS_external_1kBin_rerun_mnase/feather_output/LIB5435096_HITS5455050_20201024_094918//LIB5435096_HITS5455050.paired
True
Sat Oct 24 10:05:44 2020 sorting by coordinates.
Sat Oct 24 10:06:02 2020 calling samtools markdup
Sat Oct 24 10:06:26 2020 calling samtools flagstat on mapped file
Sat Oct 24 10:06:39 2020 calling samtools flagstat on mapped and duplicate-removed file
Sat Oct 24 10:06:52 2020 calling samtools sort for sorting by query names
Sat Oct 24 10:07:17 2020 finishing filtering
Sat Oct 24 10:07:17 2020 starting the splitting operation
Sat Oct 24 10:07:17 2020 QC metrics will be computed from: /gnet/is6/p04/data/dnaseq/analysis/led13/inputs/chip_peaks/Rec-1_WO_H3K27ac_1kStep_peaks.narrowPeak and /gnet/is6/p04/data/dnaseq/analysis/led13/outputs/R6516_shallow_MAPS_external_1kBin_rerun_mnase/feather_output/LIB5435096_HITS5455050_20201024_094918/LIB5435096_HITS5455050.paired.rmdup.flagstat
Sat Oct 24 10:09:46 2020 generating long, intra-chromosomal bedpe file(s)
Sat Oct 24 10:10:18 2020 writing to the combined bedpe file
Sat Oct 24 10:10:45 2020 splitting completed
Sat Oct 24 09:49:21 2020 starting mapping and filtering operation
Sat Oct 24 09:49:21 2020 calling bwa for /gstore/data/dnaseq/processed_runs/R6516/results/LIB5435096_HITS5455050_R1.fastq.gz
Sat Oct 24 09:56:35 2020 calling bwa for /gstore/data/dnaseq/processed_runs/R6516/results/LIB5435096_HITS5455050_R2.fastq.gz
Sat Oct 24 10:01:47 2020 calling samtools sort for /gnet/is6/p04/data/dnaseq/analysis/led13/outputs/R6516_shallow_MAPS_external_1kBin_rerun_mnase/feather_output/LIB5435096_HITS5455050_20201024_094918//tempfiles/LIB5435096_HITS5455050_R1.fastq.gz.bwa.sam storing in /gnet/is6/p04/data/dnaseq/analysis/led13/outputs/R6516_shallow_MAPS_external_1kBin_rerun_mnase/feather_output/LIB5435096_HITS5455050_20201024_094918//tempfiles/LIB5435096_HITS5455050_R1.fastq.gz.bwa.sam.srtn
Sat Oct 24 10:02:02 2020 calling samtools sort for /gnet/is6/p04/data/dnaseq/analysis/led13/outputs/R6516_shallow_MAPS_external_1kBin_rerun_mnase/feather_output/LIB5435096_HITS5455050_20201024_094918//tempfiles/LIB5435096_HITS5455050_R2.fastq.gz.bwa.sam storing in /gnet/is6/p04/data/dnaseq/analysis/led13/outputs/R6516_shallow_MAPS_external_1kBin_rerun_mnase/feather_output/LIB5435096_HITS5455050_20201024_094918//tempfiles/LIB5435096_HITS5455050_R2.fastq.gz.bwa.sam.srtn
Sat Oct 24 10:02:17 2020 merging /gnet/is6/p04/data/dnaseq/analysis/led13/outputs/R6516_shallow_MAPS_external_1kBin_rerun_mnase/feather_output/LIB5435096_HITS5455050_20201024_094918//tempfiles/LIB5435096_HITS5455050_R1.fastq.gz.bwa.sam.srtn and /gnet/is6/p04/data/dnaseq/analysis/led13/outputs/R6516_shallow_MAPS_external_1kBin_rerun_mnase/feather_output/LIB5435096_HITS5455050_20201024_094918//tempfiles/LIB5435096_HITS5455050_R2.fastq.gz.bwa.sam.srtn
Sat Oct 24 10:03:31 2020 filtering and pairing reads
Sat Oct 24 10:05:28 2020 paired bam file generated. Calling fixmate
/gnet/is6/p04/data/dnaseq/analysis/led13/outputs/R6516_shallow_MAPS_external_1kBin_rerun_mnase/feather_output/LIB5435096_HITS5455050_20201024_094918//LIB5435096_HITS5455050.paired
True
Sat Oct 24 10:05:44 2020 sorting by coordinates.
Sat Oct 24 10:06:02 2020 calling samtools markdup
Sat Oct 24 10:06:26 2020 calling samtools flagstat on mapped file
Sat Oct 24 10:06:39 2020 calling samtools flagstat on mapped and duplicate-removed file
Sat Oct 24 10:06:52 2020 calling samtools sort for sorting by query names
Sat Oct 24 10:07:17 2020 finishing filtering
Sat Oct 24 10:07:17 2020 starting the splitting operation
Sat Oct 24 10:07:17 2020 QC metrics will be computed from: /gnet/is6/p04/data/dnaseq/analysis/led13/inputs/chip_peaks/Rec-1_WO_H3K27ac_1kStep_peaks.narrowPeak and /gnet/is6/p04/data/dnaseq/analysis/led13/outputs/R6516_shallow_MAPS_external_1kBin_rerun_mnase/feather_output/LIB5435096_HITS5455050_20201024_094918/LIB5435096_HITS5455050.paired.rmdup.flagstat
Sat Oct 24 10:09:46 2020 generating long, intra-chromosomal bedpe file(s)
Sat Oct 24 10:10:18 2020 writing to the combined bedpe file
Sat Oct 24 10:10:45 2020 splitting completed
For the MNase features analysis, I found this error message in my slurm log:
Error in vglm.fitter(x = x, y = y, w = w, offset = offset, Xm2 = Xm2, :
vglm() only handles full-rank models (currently)
Calls: pospoisson_regression -> vglm -> vglm.fitter
Execution halted
Error in file(file, "rt") : cannot open the connection
Calls: read.table -> file
In addition: Warning message:
In file(file, "rt") :
cannot open file '/gnet/is6/p04/data/dnaseq/analysis/led13/outputs/R6516_shallow_MAPS_external_1kBin_rerun_mnase/MAPS_output/LIB5435096_HITS5455050_20201024_094918/LIB5435096_HITS5455050.5k.2.peaks': No such file or directory
Execution halted
bamFilesList: ['/gnet/is6/p04/data/dnaseq/analysis/led13/outputs/R6516_shallow_MAPS_external_1kBin_rerun_mnase/feather_output/LIB5435096_HITS5455050_20201024_094918//LIB5435096_HITS5455050.paired.rmdup.bam']
Thanks! I can reproduce the same error. As I mentioned in the previous response we should remove the effective length from the regression, I had [mistakenly] made an assumption that if all "effective lengths" are the same, the regression modeler will automatically discard their column.
I will modify the code later to accept a new argument which will automatically fix for this, but in the meantime for a quick fix I recommend manually modifying the code in the _
before1 (line 105): fit <- vglm(count ~ logl + loggc + logm + logdist + logShortCount, family = pospoisson(), data = mm)
before2 (line 109): fit <- vglm(count ~ logl + loggc + logm + logdist + logShortCount, family = pospoisson(), data = m1)
to:
after1 (line 105): fit <- vglm(count ~ loggc + logm + logdist + logShortCount, family = pospoisson(), data = mm)
after2 (line 109): fit <- vglm(count ~ loggc + logm + logdist + logShortCount, family = pospoisson(), data = m1)
Thanks. I implemented the recommended change in the MAPS_regression_and_peak_caller.r
file. Now, *.MAPS2_pospoisson
files are generated. However, the significant interactions bed file is NOT generated. In addition, the WashU browser track files are NOT generated.
I see this error in my log:
Error in file(file, "rt") : cannot open the connection
Calls: read.table -> file
In addition: Warning message:
In file(file, "rt") :
cannot open file '/gnet/is6/p04/data/dnaseq/analysis/led13/outputs/R6516_shallow_MAPS_external_1kBin_rerun_mnase/MAPS_output/LIB5435096_HITS5455050_20201026_144232/LIB5435096_HITS5455050.5k.2.peaks': No such file or directory
Execution halted
So, regression breaks b/c we set all effective lengths to the same value so there's no
variation associated with that predictor (oops, I should have thought about it earlier!)
The way to fix it is to just remove it from regression and then you can use normal
genomic feature file you would with regular MAPS (of the one Armen gave you, no difference).
The easiest way to change regression is to hack pospoisson_regression function in
MAPS_regression_and_peak_caller.r file
So, in your MAPS_regression_and_peak_caller.r file
(file looks like this: https://github.com/ijuric/MAPS/blob/master/bin/MAPS/MAPS_regression_and_peak_caller.r )
comment out your current pospoisson_regression function and replace it with this:
pospoisson_regression <- function(mm, dataset_length) {
fit <- vglm(count ~ loggc + logm + logdist + logShortCount, family =
pospoisson(), data = mm)
mm$expected = fitted(fit)
mm$p_val = ppois(mm$count, mm$expected, lower.tail = FALSE, log.p =
FALSE) / ppois(0, mm$expected, lower.tail = FALSE, log.p = FALSE)
m1 = mm[ mm$p_val > 1/length(mm$p_val),]
fit <- vglm(count ~ loggc + logm + logdist + logShortCount, family =
pospoisson(), data = m1)
coeff<-round(coef(fit),10)
mm$expected2 <- round(exp(coeff[1] + coeff[2]*mm$loggc +
coeff[3]*mm$logm +
coeff[4]mm$logdist + coeff[5]mm$logShortCount), 10)
mm$expected2 <- mm$expected2 /(1-exp(-mm$expected2))
mm$ratio2 <- mm$count / mm$expected2
mm$p_val_reg2 = ppois(mm$count, mm$expected2, lower.tail = FALSE, log.p
= FALSE) / ppois(0, mm$expected2, lower.tail = FALSE, log.p = FALSE)
mm$p_bonferroni = mm$p_val_reg2 * dataset_length
mm$fdr <- p.adjust(mm$p_val_reg2, method='fdr')
return(mm)
}
In this function I have removed logl from regression lines (and when calculating regression coefficients).
Then just run MAPS with your data as you would normally and it should work.
You'll be fitting a regression model which have gc, mappability, 1D distance and short counts,
but not effective length, and that is what you want since MNase cuts randomly.
Just be careful when doing this hack to revert to your old pospoisson_regression function when analyzing
PLAC-seq/HiChIP data that relies on effective length.
Hope this helps.
Best,
Ivan
On Mon, Oct 26, 2020 at 6:23 PM danledinh notifications@github.com wrote:
Thanks. I implemented the recommended change in the MAPS_regression_and_peak_caller.r file. Now, *.MAPS2_pospoisson files are generated. However, the significant interactions bed file is NOT generated. In addition, the WashU browser track files are NOT generated.
I see this error in my log:
Error in file(file, "rt") : cannot open the connection Calls: read.table -> file In addition: Warning message: In file(file, "rt") : cannot open file '/gnet/is6/p04/data/dnaseq/analysis/led13/outputs/R6516_shallow_MAPS_external_1kBin_rerun_mnase/MAPS_output/LIB5435096_HITS5455050_20201026_144232/LIB5435096_HITS5455050.5k.2.peaks': No such file or directory Execution halted
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ijuric/MAPS/issues/25#issuecomment-716856397, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACZ6SZMR4H5F5KAKIFO6OSTSMXZETANCNFSM4PWZVWLA .
That did the trick! Thanks.
ref: https://dovetailgenomics.com/hichip-product-page/
Specifically, how can I create genome features files for MNase-based protocols?