robinweide / GENOVA

GENome Organisation Visual Analytics
GNU General Public License v3.0
70 stars 15 forks source link

calculating insulation score get inconsistant results #350

Open jiangshan529 opened 8 months ago

jiangshan529 commented 8 months ago

Hi, I am using GENOVA to calculate insulation score. the code is

hic <- load_contacts(signal_path = './hela_s3_insitu_hic_hg38_4DNFIAS8LV1C_5k.cool') insulation <- insulation_score(hic, window = 25 ) data <- setNames( insulation$insula_score[-4], c("chrom", "start", "end", "value") ) data <- data[is.finite(data$value),] gr <- with(data, GRanges(chrom, IRanges(start + 1, end))) score(gr) <- data$value export.bedGraph(gr, here::here("my_bedgraph.bedGraph")).

Previously, when I use window 25 for resolution of 5k, it always gives me narrow windows of insulation score switching between negative and positive values. But this time when I use the same code for another file of 5k insulation(around 500M reads), it gives me very wide window, and the most negative values cannot align to TAD boundaries in hic map. Could you please give me some hints on what's the problem here?

image the first track is the insulation score this time, the second track is previous calculated insulation score. Thanks!

teunbrand commented 8 months ago

Is this specific to this region or genome wide? Also have you looked at the contact maps themselves and do they noticably differ near the diagonal?

jiangshan529 commented 8 months ago

Is this specific to this region or genome wide? Also have you looked at the contact maps themselves and do they noticably differ near the diagonal?

It's genome-wide. And I extracted matrix from some regions and plot the hic heatmap, it looks alright.

teunbrand commented 8 months ago

So the only difference between the top and bottom is the data itself, right? No differences in data resolution, preprocessing, analysis code such as parameters, versions of software, etc?

jiangshan529 commented 8 months ago

So the only difference between the top and bottom is the data itself, right? No differences in data resolution, preprocessing, analysis code such as parameters, versions of software, etc?

yeah, for the ins calculation itself, I also tested the code with my previous data, there's no difference. However, I also cannot find what' the difference for the data preprocessing itself. I use bwa mem to map, then pairtools for contact matrics, and then juicertools to generate .hic file, then hic2cool to extract 5kb cool file, then cooler balance to normalize the data, then use GENOVA to calculate ins. I also tried not to use cooler balance, but it still gives me the same wired ins value.

teunbrand commented 8 months ago

I sympathise with the detective work, but if this is a data issue and not a software issue, there is little I can do. There isn't anything obvious in the tracks that I recognise as a particular problem with a clear solution.

jiangshan529 commented 1 month ago

I sympathise with the detective work, but if this is a data issue and not a software issue, there is little I can do. There isn't anything obvious in the tracks that I recognise as a particular problem with a clear solution.

Hi, Teun. Now I have a hic file with 1.7 billion reads, it repots error when i calculate ins score:

insulation <- insulation_score(hic, window = 25) Error in vecseq(f, len, if (allow.cartesian || notjoin || !anyDuplicated(f__, : Join results in 772086250 rows; more than 46325175 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice. hic A GENOVA contacts object named 'wt_merge_all_1.48B_UU_dedup.pairs.mcool' at a resolution of 5 kb. Contains the following slots:

  • MAT : Triplet format matrix containing 676777675 informative bins.
  • IDX : 617669 genomic indices in BED format.
  • CHRS : A vector of 24 chromosome names.
  • CENTROMERES: Locations of 24 centromeres. This object is assigned the colour 'black' 0 bins are masked. Some chromosomes have been removed. The data have not been Z-score normalised. The original data were loaded in as balanced data.