omegahh / DeepHiC

A GAN-based method for Enhancing Hi-C data
MIT License
28 stars 8 forks source link

Input-Output Dimension #2

Open ekokrek opened 4 years ago

ekokrek commented 4 years ago

Hello,

I tried to predict high resolution matrix using the processed file that you share.

My assumptions were:

I placed the above file into the specified directory. Then choosing the 1/16 model parameters, I ran the following code.

python data_predict.py -lr 40kb -ckpt save/deephic_raw_16.pth -c GM12878

The chromosome I am interested in is 12. The size of this chromosome is 133851895 bases; so when it is binned at 10kb, one should have 13,386 bins. However, the predicted chromosome 12 matrix has dimensions of 13,398 x 13,398. When I checked the input file, I've seen that 'sizes' key in the dictionary holds this same value of 13398 for chromosome 12. That discrepancy occurs in other chromosomes too.

So the question is: How are these shapes/sizes are calculated?

Thanks in advance!

omegahh commented 4 years ago

Sorry for delayed reply, The Hi-C data were downloaded from GSE63525, and only .tar.gz files were available when we downloaded these data.

I checked the raw data from GSE63525 (e.g. GSE63525_GM12878_primary_intrachromosomal_contact_matrices.tar.gz). The largest index for binned coordinates in the three-column-tab file (chr12_10kb.RAWobserved) is 133840000. But there are 13398 values in the bias file (chr12_10kb.KRnorm/SQRTVCnorm/VCnorm). The processed matrix is expanded to 13398 to match the bias file. But values in bias file are NaNs when row index is larger than 13384, so corresponding values in Hi-C matrix are all zeros.

ekokrek commented 4 years ago

Yes, I realized that the dimensions are taken from the KRnorm vector. However, I saw that there are Nan's in initial rows and final rows. I couldn't decide from which direction I should trim the predicted matrix, since I didn't know the normalization procedure very well.

So, I guess the final rows and columns are the "extra/trimmable" Nan values, would you agree with that? My main aim is to compare the final chromosome matrix with other predicted matrices, so I don't want to shift the values in any way and obtain a low similarity value.

Thanks again for the reply :)

omegahh commented 4 years ago

Yes, I agree with you. According to the description in the README file (GSE63525_GM12878_primary_README.rtf)

To normalize this entry using the KR normalization vector, one would divide 59.0 by the 8001st line ((40000000/5000)+1=8001) and the 8021st line ((40100000/5000)+1=8021) of GM12878_primary/5kb_resolution_intrachromosomal/chr1/MAPQGE30/chr1_5kb.KRnorm.

We can see that the genome locations are converted to line numbers in the bias vector without shift at the beginning. So I think the final rows and columns could be omitted.