Open ekokrek opened 4 years ago
HI ekokrek, I'm sorry to hear that you can not get good results. Before I go through with your code, I need to confirm that 1) did you train hicGAN with the data provided in our study or with your custom data? 2) did you preprocess in data as demonstrated in data_split.py? Thanks!
Thank you for the quick reply.
I used GM12878 primary cell line
I downloaded, preprocessed and splitted the data using your scripts.
Only difference is the chromosome choice.
Training: 1, 2, 3, 6, 7, 9, 11, 13, 15, 16, 17, 19, 20
hr_mats_train,lr_mats_train,distance_train = data_split(['chr%d'%idx for idx in list((1, 2, 3, 6, 7, 9, 11, 13, 15, 16, 17, 19, 20))])
Validation: 4, 8, 14, 21
hr_mats_valid,lr_mats_valid,distance_valid = data_split(['chr%d'%idx for idx in list((4, 8, 14, 21))])
Test: 5,10,12,18,22
hr_mats_test,lr_mats_test,distance_test = data_split(['chr%d'%idx for idx in list((5,10,12,18,22))])
I trained the model using run_hicGAN.py it took ~35hrs
Then I divided test_data.hkl into per chromosome .npy inputs for hicGAN_predict.py Because hicGAN_predict.py doesn't accept test_data.hkl
import hickle as hkl
import pandas as pd
import re
import numpy as np
mat = hkl.load("../test_data.hkl")
dists = mat[2]
distc = []
for i in range(0,len(dists)):
distc.append(dists[i][1])
predchr = pd.unique(distc)
tind = 0
subs = [0]
for cname in predchr:
initiate_ind = sum(subs)
str2join = ["\'",cname,"\'"]
chrPattern = "".join(str2join)
subs.append(len(re.findall(chrPattern, str(dists)))) # number of submatrices in a chromosome
print(subs[tind])
z = mat[1][initiate_ind:initiate_ind+subs[tind+1],:,:,:]
print(z.shape)
tind+=1
np.save('test_%s_input.npy'%cname,z)
np.save("test_allchr_subregion_inds.npy",subs)
Finally I obtained sr_mat_pre.npy's for each chromosome.
I asked about the update because previously for chromosome 12 the model created 2901 submatrices now it creates 2899. What would be the reason behind this change ?
The model training should not be so slow. Which kind of GPU are you using?
And I suggest that you can refer to hicGAN_evaluate.py which can accepts test_data.hkl
.
When you use hicGAN_evaluate.py, several metrics will be reported such as MSE and PSNR.
You can also report the MSE and PSNR achieved by you, then I can tell whether this model is well trained or not.
Thanks.
I used GM12878 primary data.
Downloaded and preprocessed using your bash scripts raw_data_download_script.sh
and preprocess.sh
Then I used data_split.py to create my own input hkl file lowResInput_data.hkl out of chr18.
in_c18, tar_c18, dist_c18 = data_split(["chr18"])
hkl.dump([in_c18, tar_c18, dist_c18],'data/%s/lowResInput_data.hkl'%cell)
I used that hkl in hicGAN_evaluate.py along with pretrained weights that you share (./pretrain/g_hicgan_GM12878_weights.npz) as follows:
lr_mats_test,hr_mats_test, _ = hkl.load('data/%s/lowResInput_data.hkl'%cell)
the result is: mse_hicGAN_norm:0.73867 psnr_hicGAN_norm:-2.02605
So, do you think these results are ok? a negative psnr?
I used GM12878 primary data. Downloaded and preprocessed using your bash scripts
raw_data_download_script.sh
andpreprocess.sh
Then I used data_split.py to create my own input hkl file lowResInput_data.hkl out of chr18.
in_c18, tar_c18, dist_c18 = data_split(["chr18"]) hkl.dump([in_c18, tar_c18, dist_c18],'data/%s/lowResInput_data.hkl'%cell)
I used that hkl in hicGAN_evaluate.py along with pretrained weights that you share (./pretrain/g_hicgan_GM12878_weights.npz) as follows:
lr_mats_test,hr_mats_test, _ = hkl.load('data/%s/lowResInput_data.hkl'%cell)
the result is: mse_hicGAN_norm:0.73867 psnr_hicGAN_norm:-2.02605
So, do you think these results are ok? a negative psnr?
Hi, I don't think this is a reasonable result. Another user emailed me today and reported that when using the pretrain model. you can achieve median MSE:0.02075 PSNR:15.40483 SSIM:0.18497. As you guys achieved different results. I'll check this out soon. Thanks.
Hello there,
I am using your model. Even with pre-trained model parameters, I obtain pretty bad predictions.
I want to combine the subregions into per chromosome matrix and then compare instead of one-to-one subregion comparison. Below is the reverse of your code to combine subregions into the whole matrix:
we'll need chromosome sizes and indices of submatrices from the original test_data
recombine the predicted matrix into original dimensions
I deal with your model for a while now and I couldn't detect my mistake if there is any.
So what do you think, when you updated the model do you think there appeared a problem?
Thank you,