Model performance: Pearson correlation score

kundajelab / chrombpnet

Bias factorized, base-resolution deep learning models of chromatin accessibility (chromBPNet)

https://github.com/kundajelab/chrombpnet/wiki

MIT License

118 stars 33 forks source link

Model performance: Pearson correlation score #173

Closed arodel21 closed 9 months ago

arodel21 commented 9 months ago

Hello again

I hope you have all enjoyed the winter break.

I just wanted to ask advice on how to improve the model performance, specifically ChromBPNet's Pearson correlation score in peaks. The overall report shows that pearsonr score (0.197) is below the threshold (0.5) when a well-performing model should have higher values. Do you have any thoughts on what could potentially cause the PCS to diverge?

Thanks in advance.

panushri25 commented 9 months ago

Can you share the html reports?

arodel21 commented 9 months ago

Here it goes!

overall_report.pdf

Thank you!

panushri25 commented 9 months ago

Hmm...something doesn't look right.

(1) Can you share a browser session with the observed bigwig and peaks? (2) Can you share some stats about your input data - read depth, fraction of reads in peaks, if you have multiple replicates some concordance metrics etc maybe? (3) What is the content in your peaks? Can you share a few rows in your peak file ?

panushri25 commented 9 months ago

Also for completeness share the command you are using to train the models.

arodel21 commented 9 months ago

Thanks for the reply.

Just to confirm and considering the parameters used in chrombpnet pipeline command, by bigwig do you refer to the bigwig of the ibam file?

After you confirm I will get you the stats.

I have been thinking that the correlation difference might be due to the background of the reads and the peaks. The peaks are genomic regions specific to a cell type, while the reads contain multiple cell types, including the one used for the peaks. Do you think this could influence the pearson correlation score at all?

arodel21 commented 9 months ago

The commands I used are

For creating nonpeaks

bedtools slop -i $blacklist -g $chrom_sizes -b 1057 > temp.bed
bedtools intersect -v -a $peaks -b temp.bed  > peaks_no_blacklist.bed
chrombpnet prep nonpeaks -g $genome -p peaks_no_blacklist.bed -c  $chrom_sizes -fl $splits -br $blacklist -o output -il 2114

Where genome is GRCz11.fa.

For training the bias model: chrombpnet bias pipeline -ibam $bam -d "ATAC" -g $genome -c $chrom_sizes -p peaks_no_blacklist.bed -n output_negatives.bed -fl $fold -b 0.5 -o bias_model/ -fp k562
For training ChromBPNet model: chrombpnet pipeline -ibam $bam -d "ATAC" -g $genome -c $chrom_sizes -p peaks_no_blacklist.bed -n output_negatives.bed -fl $fold -b bias_model/models/k562_bias.h5 -o $output

arodel21 commented 9 months ago

The peaks are zebrafish embryo enhancer genomic regions for a specific cell type and here are some of the peaks

chr1    5410234 5410901 .   .   .   .   .   .   333
chr1    7893549 7894421 .   .   .   .   .   .   436
chr1    59180459    59181068    .   .   .   .   .   .   304
chr2    40107174    40107673    .   .   .   .   .   .   249
chr4    129992  130402  .   .   .   .   .   .   205
chr4    5254977 5255424 .   .   .   .   .   .   223
chr4    8496741 8497364 .   .   .   .   .   .   311
chr7    29864191    29864685    .   .   .   .   .   .   247
chr7    43838799    43839225    .   .   .   .   .   .   213
chr8    30442994    30443692    .   .   .   .   .   .   349
chr8    38527582    38528079    .   .   .   .   .   .   248
chr9    29665717    29666279    .   .   .   .   .   .   281
chr9    42978827    42979288    .   .   .   .   .   .   230
chr14   33296076    33297494    .   .   .   .   .   .   709
chr15   2799590 2800235 .   .   .   .   .   .   322
chr15   9888201 9888589 .   .   .   .   .   .   194
chr15   31109078    31109710    .   .   .   .   .   .   316
chr19   18991133    18991665    .   .   .   .   .   .   266
chr23   23042932    23043267    .   .   .   .   .   .   167
chr25   2641708 2642175 .   .   .   .   .   .   233

panushri25 commented 9 months ago

When you say The peaks are genomic regions specific to a cell type, while the reads contain multiple cell types - you are merging reads across multiple cell-types (which ones?) but the peaks themselves are specific to one celltype (again which one)?

What is the goal of your model?

panushri25 commented 9 months ago

Closing this due to inactivity, feel free to open this if you continue to see issues.