ijuric / MAPS

18 stars 11 forks source link

is MAPS compatible with MNase-based HiChIP data? #25

Closed danledinh closed 3 years ago

danledinh commented 3 years ago

ref: https://dovetailgenomics.com/hichip-product-page/

Specifically, how can I create genome features files for MNase-based protocols?

danledinh commented 3 years ago

Specifically, I am wondering how I can create a genomic features file to be compatible with an MNase-based protocol that does not use restriction enzymes.

armenabnousi commented 3 years ago

we recommend using bin level GC content and mappability features, and removing the effective length feature. I'm modifying the code for genomic feature creation to generate these files. Hope to have the code along with precomputed mm10 and hg19 files later today or tomorrow. Sorry about the delay in responding.

armenabnousi commented 3 years ago

I have added the features files for hg19, hg38, and mm10. I've also modified the code so you can use it to generate these files for other genomes. script is here

danledinh commented 3 years ago

Thanks for providing the update!

I re-ran my analysis with both regular hg19 and hg19_mnase genomic features. With the regular hg19, everything goes to completion as expected. However, with the hg19_mnase, the MAPS module does not produce *.MAPS2_pospoisson files. Therefore, none of the downstream files are created, such as significant interactions bed file and the WashU track files.

I am using the Arima recommended settings:

plot=1 # "1" to generate plots, "0" to skip
generate_hic=1 # "1" to generate a .hic file for visualization with Juicer, "0" to skip
bin_size=5000 # Resolution of the loops called.
binning_range=2000000 # Maximum distance for loop calling
fdr=2 # Do not change! False Discovery Rate threshold 1 = 0.1, 2 = 0.01, 3 = 0.001, ect...
filter_file="None"
mapq=30 # Phred scaled mapping quality threshold
length_cutoff=1000 # Minimum genomic distance
model="pospoisson" # Regression Model for Loop calling. Options are: "pospoisson" and "negbinom"
sex_chroms_to_process="NA" # Which Sex Chromosomes should be processed

optical_duplicate_distance=$([ $patterned_flowcell == 1 ] && echo 2500 || echo 100) # v1.9_update
armenabnousi commented 3 years ago

Can you also share the log files from the preprocessing (should be here: /feather_output/*_current/*.feather.log If there is no error in this file then the problem should be on loop calling itself, and would most likely throw an error message during the run.

danledinh commented 3 years ago

Here is the feather log:

Sat Oct 24 09:49:21 2020 starting mapping and filtering operation
Sat Oct 24 09:49:21 2020 calling bwa for /gstore/data/dnaseq/processed_runs/R6516/results/LIB5435096_HITS5455050_R1.fastq.gz
Sat Oct 24 09:56:35 2020 calling bwa for /gstore/data/dnaseq/processed_runs/R6516/results/LIB5435096_HITS5455050_R2.fastq.gz
Sat Oct 24 10:01:47 2020 calling samtools sort for /gnet/is6/p04/data/dnaseq/analysis/led13/outputs/R6516_shallow_MAPS_external_1kBin_rerun_mnase/feather_output/LIB5435096_HITS5455050_20201024_094918//tempfiles/LIB5435096_HITS5455050_R1.fastq.gz.bwa.sam storing in /gnet/is6/p04/data/dnaseq/analysis/led13/outputs/R6516_shallow_MAPS_external_1kBin_rerun_mnase/feather_output/LIB5435096_HITS5455050_20201024_094918//tempfiles/LIB5435096_HITS5455050_R1.fastq.gz.bwa.sam.srtn
Sat Oct 24 10:02:02 2020 calling samtools sort for /gnet/is6/p04/data/dnaseq/analysis/led13/outputs/R6516_shallow_MAPS_external_1kBin_rerun_mnase/feather_output/LIB5435096_HITS5455050_20201024_094918//tempfiles/LIB5435096_HITS5455050_R2.fastq.gz.bwa.sam storing in /gnet/is6/p04/data/dnaseq/analysis/led13/outputs/R6516_shallow_MAPS_external_1kBin_rerun_mnase/feather_output/LIB5435096_HITS5455050_20201024_094918//tempfiles/LIB5435096_HITS5455050_R2.fastq.gz.bwa.sam.srtn
Sat Oct 24 10:02:17 2020 merging /gnet/is6/p04/data/dnaseq/analysis/led13/outputs/R6516_shallow_MAPS_external_1kBin_rerun_mnase/feather_output/LIB5435096_HITS5455050_20201024_094918//tempfiles/LIB5435096_HITS5455050_R1.fastq.gz.bwa.sam.srtn and /gnet/is6/p04/data/dnaseq/analysis/led13/outputs/R6516_shallow_MAPS_external_1kBin_rerun_mnase/feather_output/LIB5435096_HITS5455050_20201024_094918//tempfiles/LIB5435096_HITS5455050_R2.fastq.gz.bwa.sam.srtn
Sat Oct 24 10:03:31 2020 filtering and pairing reads
Sat Oct 24 10:05:28 2020 paired bam file generated. Calling fixmate
/gnet/is6/p04/data/dnaseq/analysis/led13/outputs/R6516_shallow_MAPS_external_1kBin_rerun_mnase/feather_output/LIB5435096_HITS5455050_20201024_094918//LIB5435096_HITS5455050.paired
True
Sat Oct 24 10:05:44 2020 sorting by coordinates.
Sat Oct 24 10:06:02 2020 calling samtools markdup
Sat Oct 24 10:06:26 2020 calling samtools flagstat on mapped file
Sat Oct 24 10:06:39 2020 calling samtools flagstat on mapped and duplicate-removed file
Sat Oct 24 10:06:52 2020 calling samtools sort for sorting by query names
Sat Oct 24 10:07:17 2020 finishing filtering
Sat Oct 24 10:07:17 2020 starting the splitting operation
Sat Oct 24 10:07:17 2020 QC metrics will be computed from: /gnet/is6/p04/data/dnaseq/analysis/led13/inputs/chip_peaks/Rec-1_WO_H3K27ac_1kStep_peaks.narrowPeak and /gnet/is6/p04/data/dnaseq/analysis/led13/outputs/R6516_shallow_MAPS_external_1kBin_rerun_mnase/feather_output/LIB5435096_HITS5455050_20201024_094918/LIB5435096_HITS5455050.paired.rmdup.flagstat
Sat Oct 24 10:09:46 2020 generating long, intra-chromosomal bedpe file(s)
Sat Oct 24 10:10:18 2020 writing to the combined bedpe file
Sat Oct 24 10:10:45 2020 splitting completed
Sat Oct 24 09:49:21 2020 starting mapping and filtering operation
Sat Oct 24 09:49:21 2020 calling bwa for /gstore/data/dnaseq/processed_runs/R6516/results/LIB5435096_HITS5455050_R1.fastq.gz
Sat Oct 24 09:56:35 2020 calling bwa for /gstore/data/dnaseq/processed_runs/R6516/results/LIB5435096_HITS5455050_R2.fastq.gz
Sat Oct 24 10:01:47 2020 calling samtools sort for /gnet/is6/p04/data/dnaseq/analysis/led13/outputs/R6516_shallow_MAPS_external_1kBin_rerun_mnase/feather_output/LIB5435096_HITS5455050_20201024_094918//tempfiles/LIB5435096_HITS5455050_R1.fastq.gz.bwa.sam storing in /gnet/is6/p04/data/dnaseq/analysis/led13/outputs/R6516_shallow_MAPS_external_1kBin_rerun_mnase/feather_output/LIB5435096_HITS5455050_20201024_094918//tempfiles/LIB5435096_HITS5455050_R1.fastq.gz.bwa.sam.srtn
Sat Oct 24 10:02:02 2020 calling samtools sort for /gnet/is6/p04/data/dnaseq/analysis/led13/outputs/R6516_shallow_MAPS_external_1kBin_rerun_mnase/feather_output/LIB5435096_HITS5455050_20201024_094918//tempfiles/LIB5435096_HITS5455050_R2.fastq.gz.bwa.sam storing in /gnet/is6/p04/data/dnaseq/analysis/led13/outputs/R6516_shallow_MAPS_external_1kBin_rerun_mnase/feather_output/LIB5435096_HITS5455050_20201024_094918//tempfiles/LIB5435096_HITS5455050_R2.fastq.gz.bwa.sam.srtn
Sat Oct 24 10:02:17 2020 merging /gnet/is6/p04/data/dnaseq/analysis/led13/outputs/R6516_shallow_MAPS_external_1kBin_rerun_mnase/feather_output/LIB5435096_HITS5455050_20201024_094918//tempfiles/LIB5435096_HITS5455050_R1.fastq.gz.bwa.sam.srtn and /gnet/is6/p04/data/dnaseq/analysis/led13/outputs/R6516_shallow_MAPS_external_1kBin_rerun_mnase/feather_output/LIB5435096_HITS5455050_20201024_094918//tempfiles/LIB5435096_HITS5455050_R2.fastq.gz.bwa.sam.srtn
Sat Oct 24 10:03:31 2020 filtering and pairing reads
Sat Oct 24 10:05:28 2020 paired bam file generated. Calling fixmate
/gnet/is6/p04/data/dnaseq/analysis/led13/outputs/R6516_shallow_MAPS_external_1kBin_rerun_mnase/feather_output/LIB5435096_HITS5455050_20201024_094918//LIB5435096_HITS5455050.paired
True
Sat Oct 24 10:05:44 2020 sorting by coordinates.
Sat Oct 24 10:06:02 2020 calling samtools markdup
Sat Oct 24 10:06:26 2020 calling samtools flagstat on mapped file
Sat Oct 24 10:06:39 2020 calling samtools flagstat on mapped and duplicate-removed file
Sat Oct 24 10:06:52 2020 calling samtools sort for sorting by query names
Sat Oct 24 10:07:17 2020 finishing filtering
Sat Oct 24 10:07:17 2020 starting the splitting operation
Sat Oct 24 10:07:17 2020 QC metrics will be computed from: /gnet/is6/p04/data/dnaseq/analysis/led13/inputs/chip_peaks/Rec-1_WO_H3K27ac_1kStep_peaks.narrowPeak and /gnet/is6/p04/data/dnaseq/analysis/led13/outputs/R6516_shallow_MAPS_external_1kBin_rerun_mnase/feather_output/LIB5435096_HITS5455050_20201024_094918/LIB5435096_HITS5455050.paired.rmdup.flagstat
Sat Oct 24 10:09:46 2020 generating long, intra-chromosomal bedpe file(s)
Sat Oct 24 10:10:18 2020 writing to the combined bedpe file
Sat Oct 24 10:10:45 2020 splitting completed
danledinh commented 3 years ago

For the MNase features analysis, I found this error message in my slurm log:

Error in vglm.fitter(x = x, y = y, w = w, offset = offset, Xm2 = Xm2,  : 
  vglm() only handles full-rank models (currently)
Calls: pospoisson_regression -> vglm -> vglm.fitter
Execution halted
Error in file(file, "rt") : cannot open the connection
Calls: read.table -> file
In addition: Warning message:
In file(file, "rt") :
  cannot open file '/gnet/is6/p04/data/dnaseq/analysis/led13/outputs/R6516_shallow_MAPS_external_1kBin_rerun_mnase/MAPS_output/LIB5435096_HITS5455050_20201024_094918/LIB5435096_HITS5455050.5k.2.peaks': No such file or directory
Execution halted
bamFilesList: ['/gnet/is6/p04/data/dnaseq/analysis/led13/outputs/R6516_shallow_MAPS_external_1kBin_rerun_mnase/feather_output/LIB5435096_HITS5455050_20201024_094918//LIB5435096_HITS5455050.paired.rmdup.bam']
armenabnousi commented 3 years ago

Thanks! I can reproduce the same error. As I mentioned in the previous response we should remove the effective length from the regression, I had [mistakenly] made an assumption that if all "effective lengths" are the same, the regression modeler will automatically discard their column.

I will modify the code later to accept a new argument which will automatically fix for this, but in the meantime for a quick fix I recommend manually modifying the code in the _/bin/MAPS/MAPS_regression_and_peakcaller.r on lines 105 and 109, by removing the "logl" variable: i.e.:

before1 (line 105): fit <- vglm(count ~ logl + loggc + logm + logdist + logShortCount, family = pospoisson(), data = mm) 
before2 (line 109): fit <- vglm(count ~ logl + loggc + logm + logdist + logShortCount, family = pospoisson(), data = m1)

to:

after1 (line 105): fit <- vglm(count ~ loggc + logm + logdist + logShortCount, family = pospoisson(), data = mm) 
after2 (line 109): fit <- vglm(count ~ loggc + logm + logdist + logShortCount, family = pospoisson(), data = m1)
danledinh commented 3 years ago

Thanks. I implemented the recommended change in the MAPS_regression_and_peak_caller.r file. Now, *.MAPS2_pospoisson files are generated. However, the significant interactions bed file is NOT generated. In addition, the WashU browser track files are NOT generated.

I see this error in my log:

Error in file(file, "rt") : cannot open the connection
Calls: read.table -> file
In addition: Warning message:
In file(file, "rt") :
  cannot open file '/gnet/is6/p04/data/dnaseq/analysis/led13/outputs/R6516_shallow_MAPS_external_1kBin_rerun_mnase/MAPS_output/LIB5435096_HITS5455050_20201026_144232/LIB5435096_HITS5455050.5k.2.peaks': No such file or directory
Execution halted
ijuric commented 3 years ago

So, regression breaks b/c we set all effective lengths to the same value so there's no

variation associated with that predictor (oops, I should have thought about it earlier!)

The way to fix it is to just remove it from regression and then you can use normal

genomic feature file you would with regular MAPS (of the one Armen gave you, no difference).

The easiest way to change regression is to hack pospoisson_regression function in

MAPS_regression_and_peak_caller.r file

So, in your MAPS_regression_and_peak_caller.r file

(file looks like this: https://github.com/ijuric/MAPS/blob/master/bin/MAPS/MAPS_regression_and_peak_caller.r )

comment out your current pospoisson_regression function and replace it with this:

pospoisson_regression <- function(mm, dataset_length) {

fit <- vglm(count ~ loggc + logm + logdist + logShortCount, family =

pospoisson(), data = mm)

mm$expected = fitted(fit)

mm$p_val = ppois(mm$count, mm$expected, lower.tail = FALSE, log.p =

FALSE) / ppois(0, mm$expected, lower.tail = FALSE, log.p = FALSE)

m1 = mm[ mm$p_val > 1/length(mm$p_val),]

fit <- vglm(count ~ loggc + logm + logdist + logShortCount, family =

pospoisson(), data = m1)

coeff<-round(coef(fit),10)

mm$expected2 <- round(exp(coeff[1] + coeff[2]*mm$loggc +

coeff[3]*mm$logm +

coeff[4]mm$logdist + coeff[5]mm$logShortCount), 10)

mm$expected2 <- mm$expected2 /(1-exp(-mm$expected2))

mm$ratio2 <- mm$count / mm$expected2

mm$p_val_reg2 = ppois(mm$count, mm$expected2, lower.tail = FALSE, log.p

= FALSE) / ppois(0, mm$expected2, lower.tail = FALSE, log.p = FALSE)

mm$p_bonferroni = mm$p_val_reg2 * dataset_length

mm$fdr <- p.adjust(mm$p_val_reg2, method='fdr')

return(mm)

}

In this function I have removed logl from regression lines (and when calculating regression coefficients).

Then just run MAPS with your data as you would normally and it should work.

You'll be fitting a regression model which have gc, mappability, 1D distance and short counts,

but not effective length, and that is what you want since MNase cuts randomly.

Just be careful when doing this hack to revert to your old pospoisson_regression function when analyzing

PLAC-seq/HiChIP data that relies on effective length.

Hope this helps.

Best,

Ivan

On Mon, Oct 26, 2020 at 6:23 PM danledinh notifications@github.com wrote:

Thanks. I implemented the recommended change in the MAPS_regression_and_peak_caller.r file. Now, *.MAPS2_pospoisson files are generated. However, the significant interactions bed file is NOT generated. In addition, the WashU browser track files are NOT generated.

I see this error in my log:

Error in file(file, "rt") : cannot open the connection Calls: read.table -> file In addition: Warning message: In file(file, "rt") : cannot open file '/gnet/is6/p04/data/dnaseq/analysis/led13/outputs/R6516_shallow_MAPS_external_1kBin_rerun_mnase/MAPS_output/LIB5435096_HITS5455050_20201026_144232/LIB5435096_HITS5455050.5k.2.peaks': No such file or directory Execution halted

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ijuric/MAPS/issues/25#issuecomment-716856397, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACZ6SZMR4H5F5KAKIFO6OSTSMXZETANCNFSM4PWZVWLA .

danledinh commented 3 years ago

That did the trick! Thanks.