kimmo1019 / hicGAN

Hi-C data super-resolution with generative adversarial networks
MIT License
15 stars 9 forks source link

Bad Prediction #5

Open ekokrek opened 4 years ago

ekokrek commented 4 years ago

Hello there,

I am using your model. Even with pre-trained model parameters, I obtain pretty bad predictions.

I want to combine the subregions into per chromosome matrix and then compare instead of one-to-one subregion comparison. Below is the reverse of your code to combine subregions into the whole matrix:

import numpy as np
import hickle as hkl
import pandas as pd
import math

we'll need chromosome sizes and indices of submatrices from the original test_data

df = pd.read_csv("../../../chromosome.txt", sep="\t", header=None)
chrsizes = df.values[0:25,1] # sex and mitochondrial chrs included hg19
re_mat = hkl.load("../test_data.hkl")
dists = re_mat[2]
distc = []
for i in range(0,len(dists)):
    distc.append(dists[i][1])
predchr = pd.unique(distc)
sub_inds = np.load("test_allchr_subregion_inds.npy")
thred = 200
size = 40
c = 1
for cname in predchr:
    pr_mat = np.load('%s/sr_mats_pre.npy'%cname)
    remat_ind = sum(sub_inds[:c])
    c +=1
    rematCond = re_mat[2][remat_ind:sum(sub_inds[:c])]
    pp = 0
    cnum = int(cname.split("chr")[1])
    bin = int(math.ceil(chrsizes[cnum-1]/10000.0)) # ceil returns float ! 
    row,col = bin,bin
    sr_mat = -1*np.ones((row,col))

recombine the predicted matrix into original dimensions

    for idx1 in range(0,row-size,size):
        for idx2 in range (0,col-size,size):
            my_cond = rematCond[pp][:]==[idx1-idx2,cname]
            if (abs(idx1-idx2)<thred) & (my_cond):
                sr_mat[idx1:idx1+size,idx2:idx2+size] = pr_mat[pp].reshape(40,40)
                pp+=1           
            if pp==pr_mat.shape[0]:
                break;      
        if pp==pr_mat.shape[0]:
            break;

    np.save("./pred_%s_hicGAN.npy"%cname,sr_mat)

I deal with your model for a while now and I couldn't detect my mistake if there is any.

So what do you think, when you updated the model do you think there appeared a problem?

Thank you,

kimmo1019 commented 4 years ago

HI ekokrek, I'm sorry to hear that you can not get good results. Before I go through with your code, I need to confirm that 1) did you train hicGAN with the data provided in our study or with your custom data? 2) did you preprocess in data as demonstrated in data_split.py? Thanks!

ekokrek commented 4 years ago

Thank you for the quick reply.

import hickle as hkl
import pandas as pd
import re
import numpy as np

mat = hkl.load("../test_data.hkl")
dists = mat[2]
distc = []

for i in range(0,len(dists)):
    distc.append(dists[i][1])

predchr = pd.unique(distc)

tind = 0
subs = [0]

for cname in predchr:
    initiate_ind =  sum(subs)
    str2join = ["\'",cname,"\'"]
    chrPattern = "".join(str2join)
    subs.append(len(re.findall(chrPattern, str(dists)))) # number of submatrices in a chromosome
    print(subs[tind])
    z = mat[1][initiate_ind:initiate_ind+subs[tind+1],:,:,:]
    print(z.shape)
    tind+=1
    np.save('test_%s_input.npy'%cname,z)

np.save("test_allchr_subregion_inds.npy",subs)

Finally I obtained sr_mat_pre.npy's for each chromosome.

I asked about the update because previously for chromosome 12 the model created 2901 submatrices now it creates 2899. What would be the reason behind this change ?

kimmo1019 commented 4 years ago

The model training should not be so slow. Which kind of GPU are you using? And I suggest that you can refer to hicGAN_evaluate.py which can accepts test_data.hkl. When you use hicGAN_evaluate.py, several metrics will be reported such as MSE and PSNR. You can also report the MSE and PSNR achieved by you, then I can tell whether this model is well trained or not. Thanks.

ekokrek commented 3 years ago

I used GM12878 primary data. Downloaded and preprocessed using your bash scripts raw_data_download_script.sh and preprocess.sh

Then I used data_split.py to create my own input hkl file lowResInput_data.hkl out of chr18.

    in_c18, tar_c18, dist_c18 = data_split(["chr18"])
    hkl.dump([in_c18, tar_c18, dist_c18],'data/%s/lowResInput_data.hkl'%cell)

I used that hkl in hicGAN_evaluate.py along with pretrained weights that you share (./pretrain/g_hicgan_GM12878_weights.npz) as follows: lr_mats_test,hr_mats_test, _ = hkl.load('data/%s/lowResInput_data.hkl'%cell)

the result is: mse_hicGAN_norm:0.73867 psnr_hicGAN_norm:-2.02605

So, do you think these results are ok? a negative psnr?

kimmo1019 commented 3 years ago

I used GM12878 primary data. Downloaded and preprocessed using your bash scripts raw_data_download_script.sh and preprocess.sh

Then I used data_split.py to create my own input hkl file lowResInput_data.hkl out of chr18.

    in_c18, tar_c18, dist_c18 = data_split(["chr18"])
    hkl.dump([in_c18, tar_c18, dist_c18],'data/%s/lowResInput_data.hkl'%cell)

I used that hkl in hicGAN_evaluate.py along with pretrained weights that you share (./pretrain/g_hicgan_GM12878_weights.npz) as follows: lr_mats_test,hr_mats_test, _ = hkl.load('data/%s/lowResInput_data.hkl'%cell)

the result is: mse_hicGAN_norm:0.73867 psnr_hicGAN_norm:-2.02605

So, do you think these results are ok? a negative psnr?

Hi, I don't think this is a reasonable result. Another user emailed me today and reported that when using the pretrain model. you can achieve median MSE:0.02075 PSNR:15.40483 SSIM:0.18497. As you guys achieved different results. I'll check this out soon. Thanks.