Closed arodel21 closed 9 months ago
Can you share the html reports?
Hmm...something doesn't look right.
(1) Can you share a browser session with the observed bigwig and peaks? (2) Can you share some stats about your input data - read depth, fraction of reads in peaks, if you have multiple replicates some concordance metrics etc maybe? (3) What is the content in your peaks? Can you share a few rows in your peak file ?
Also for completeness share the command you are using to train the models.
Thanks for the reply.
Just to confirm and considering the parameters used in chrombpnet pipeline command, by bigwig do you refer to the bigwig of the ibam file?
After you confirm I will get you the stats.
I have been thinking that the correlation difference might be due to the background of the reads and the peaks. The peaks are genomic regions specific to a cell type, while the reads contain multiple cell types, including the one used for the peaks. Do you think this could influence the pearson correlation score at all?
The commands I used are
For creating nonpeaks
bedtools slop -i $blacklist -g $chrom_sizes -b 1057 > temp.bed
bedtools intersect -v -a $peaks -b temp.bed > peaks_no_blacklist.bed
chrombpnet prep nonpeaks -g $genome -p peaks_no_blacklist.bed -c $chrom_sizes -fl $splits -br $blacklist -o output -il 2114
Where genome is GRCz11.fa.
For training the bias model:
chrombpnet bias pipeline -ibam $bam -d "ATAC" -g $genome -c $chrom_sizes -p peaks_no_blacklist.bed -n output_negatives.bed -fl $fold -b 0.5 -o bias_model/ -fp k562
For training ChromBPNet model:
chrombpnet pipeline -ibam $bam -d "ATAC" -g $genome -c $chrom_sizes -p peaks_no_blacklist.bed -n output_negatives.bed -fl $fold -b bias_model/models/k562_bias.h5 -o $output
The peaks are zebrafish embryo enhancer genomic regions for a specific cell type and here are some of the peaks
chr1 5410234 5410901 . . . . . . 333
chr1 7893549 7894421 . . . . . . 436
chr1 59180459 59181068 . . . . . . 304
chr2 40107174 40107673 . . . . . . 249
chr4 129992 130402 . . . . . . 205
chr4 5254977 5255424 . . . . . . 223
chr4 8496741 8497364 . . . . . . 311
chr7 29864191 29864685 . . . . . . 247
chr7 43838799 43839225 . . . . . . 213
chr8 30442994 30443692 . . . . . . 349
chr8 38527582 38528079 . . . . . . 248
chr9 29665717 29666279 . . . . . . 281
chr9 42978827 42979288 . . . . . . 230
chr14 33296076 33297494 . . . . . . 709
chr15 2799590 2800235 . . . . . . 322
chr15 9888201 9888589 . . . . . . 194
chr15 31109078 31109710 . . . . . . 316
chr19 18991133 18991665 . . . . . . 266
chr23 23042932 23043267 . . . . . . 167
chr25 2641708 2642175 . . . . . . 233
When you say The peaks are genomic regions specific to a cell type, while the reads contain multiple cell types
- you are merging reads across multiple cell-types (which ones?) but the peaks themselves are specific to one celltype (again which one)?
What is the goal of your model?
Closing this due to inactivity, feel free to open this if you continue to see issues.
Hello again
I hope you have all enjoyed the winter break.
I just wanted to ask advice on how to improve the model performance, specifically ChromBPNet's Pearson correlation score in peaks. The overall report shows that pearsonr score (0.197) is below the threshold (0.5) when a well-performing model should have higher values. Do you have any thoughts on what could potentially cause the PCS to diverge?
Thanks in advance.