Open MaoFuyou opened 4 months ago
Hi, can you show some visual examples so we can see what is going on in the generated PET?
I have found some issue , because of the limit of my GPU memory, the img_size only can only set very small such as (64,64,4); or (24,24,24) and with batch_size only 2; Obviously, it will lead to the final image quality very bad;
Look!This is obviously 64 64 4 ,and it is blurry because the pixel resolution is too low.
Good to know. Don't forget the patch_size (not the img_size) is there for you to adjust, which can help to avoid GPU memory issue. Try using maybe the patch_size = (256,256,4) or something similar.
Hi ! I can not get the patch_size how to adjust? I can only see this relevant parameter!
Ah just commit a new version to the jupter notebook, in the second block:
The thing is I auto pad or crop all images to a same size img_size, then random crop several patch of patch_size for training. In testing, the generated image will also automatically slide from top to bottom, form left to right through the whole image volume.
Hi you only modify these two parts? patch_size and transform ?; I don't know how to fix it? By the way the input data of mine is 646464;
Traceback (most recent call last):
File "/root/commonfiles/Maoeyu/MRI-PET-diffu/main.py", line 509, in
Oh I think maybe the commit need some time? Try it again. I check the code should have no problem. Just put the 64x64x64 to img_size, and the patch_size you want to the patch_size (make it as large as your gpu can handle). Btw, PET with size of 64x64x64? Looks too low haha.
I don't know why can lead to this issue ? Yes, I get the data PET with size 646464 and in the past this MRI-PET data have get some paper output
can you send me screenshot for this part?
` def len(self): return len(self.data)
def __getitem__(self, idx):
img_path, class_name = self.data[idx]
label_path, class_name = self.label[idx]
cao = {"image":img_path,'label':label_path}
if not self.train_flag:
affined_data_dict = self.test_transforms(cao)
img_tensor = affined_data_dict['image'].to(torch.float)
label_tensor = affined_data_dict['label'].to(torch.float)
else:
affined_data_dict = self.train_transforms(cao)
img = np.zeros([patch_num, img_size[0], img_size[1], img_size[2]])
label = np.zeros([patch_num, img_size[0], img_size[1], img_size[2]])
for i,after_l in enumerate(affined_data_dict):
img[i,:,:,:] = after_l['image']
label[i,:,:,:] = after_l['label']
img_tensor = torch.unsqueeze(torch.from_numpy(img.copy()), 1).to(torch.float)
label_tensor = torch.unsqueeze(torch.from_numpy(label.copy()), 1).to(torch.float)
return img_tensor,label_tensor`
and by the way the input data of mine is nii
Well you see my version and your version is different in the nii reader
Ok! I see it and fix it and now it run successfully in training phase but in the evaluation phase:
Traceback (most recent call last):
File "/root/commonfiles/Maoeyu/MRI-PET-diffu/main.py", line 517, in
I can deal with this issue!
I can figure out training phase is correct but the evalution phase is something wrong:
发生异常: AssertionError
input feature has wrong size
File "/root/commonfiles/Maoeyu/MRI-PET-diffu/network/nnFormer.py", line 350, in forward
assert L == S H W, "input feature has wrong size"
File "/root/commonfiles/Maoeyu/MRI-PET-diffu/network/Diffusion_model_transformer.py", line 274, in _forward
h = blk(h)
File "/root/commonfiles/Maoeyu/MRI-PET-diffu/network/util_network.py", line 138, in checkpoint
return func(inputs)
File "/root/commonfiles/Maoeyu/MRI-PET-diffu/network/Diffusion_model_transformer.py", line 251, in forward
return checkpoint(
File "/root/commonfiles/Maoeyu/MRI-PET-diffu/network/Diffusion_model_transformer.py", line 44, in forward
x = layer(x, emb)
File "/root/commonfiles/Maoeyu/MRI-PET-diffu/network/Diffusion_model_transformer.py", line 598, in forward
h = module(h, emb)
File "/root/commonfiles/Maoeyu/MRI-PET-diffu/diffusion/respace.py", line 123, in call
return self.model(x, new_ts, kwargs)
File "/root/commonfiles/Maoeyu/MRI-PET-diffu/diffusion/GaussianDiffusion.py", line 322, in p_mean_variance
model_output = model(x_input, self._scale_timesteps(t), model_kwargs)
File "/root/commonfiles/Maoeyu/MRI-PET-diffu/diffusion/respace.py", line 92, in p_mean_variance
return super().p_mean_variance(self._wrap_model(model), args, **kwargs)
File "/root/commonfiles/Maoeyu/MRI-PET-diffu/diffusion/GaussianDiffusion.py", line 481, in p_sample
out = self.p_mean_variance(
File "/root/commonfiles/Maoeyu/MRI-PET-diffu/diffusion/GaussianDiffusion.py", line 586, in p_sample_loop_progressive
out = self.p_sample(
File "/root/commonfiles/Maoeyu/MRI-PET-diffu/diffusion/GaussianDiffusion.py", line 533, in p_sample_loop
for sample in self.p_sample_loop_progressive(
File "/root/commonfiles/Maoeyu/MRI-PET-diffu/main.py", line 406, in diffusion_sampling
sampled_images = diffusion.p_sample_loop(model,(condition.shape[0], 1,
File "/root/commonfiles/Maoeyu/MRI-PET-diffu/main.py", line 429, in evaluate
sampled_images = inferer(condition,diffusion_sampling,model)
File "/root/commonfiles/Maoeyu/MRI-PET-diffu/main.py", line 517, in
Weird, this error occurs when your input image size (patch_size) is not large enough. But if you are good in training, you should be also good in testing. Can you try patch_size = (64x64x8)?
Ok I try it!
Oh no! 64648 and batch_size is 2 is out of CUDA memory . Is there any other option?
so you are good in 64x64x8? Then I suggest you use batch=1, so you can also try patch_size = (32x32x8)
Ok I am tring !
Oh no it still has something wrong!
I know what happen. The network is too deep to downsample the input into non-dividible features. In the building network block in the jupyter, try:
num_channels=64 attention_resolutions="16,8" channel_mult = (1, 2, 3) num_heads=[4,4,8] window_size = [[4,4,4],[4,4,4],[4,4,2]] num_res_blocks = [2,2,2] sample_kernel=([2,2,2],[2,2,1],[2,2,1]),
Here I reduce the network depth from 4 to 3. Try this with 32x32x4 patch size. When you have larger gpu, increase the depth again.
It is sad! I have reduce the network depth from 4 to 3 .
Maybe if you have time and I send my code and part of data to you!
how about try this one? Let's do one last try.
num_channels=64 attention_resolutions="16,8" channel_mult = (1, 2, 3) num_heads=[4,4,8] window_size = [[4,4,4],[4,4,4],[4,4,2]] num_res_blocks = [2,2,2] sample_kernel=([1,1,1],[2,2,1],[2,2,1]),
Oh it is still failed.
Let me do it tomorrow haha. I wil make one for the patch size = 32x32x8. Btw, have you try batch_size= 1 for 64x64x8?
When i use batch_size =1 and 64648 ; it ouccrs a new wrong situation:
x1 y1 is the output of my debugging phase . I can not figure out why they have the different dimension.
I see. The error tells your network output is 1x1x84x102x84 (I think because your MRI has the same size), but your truth PET image is 1x1x91x102x91, so cannot calculate the mae. That should be data error.
Another thing, I suggest you need to start this code in your reader: To make sure your image is some consistency size and should be divided by 2^4 (since you have four layers)
But be careful, the above error may tells you, you have different voxel spacing for your MRI and PET. This is not good. You should make them into same spacing then pad/crop them.
Yes, it may need some time to run and whether I can get your telegram or X account to commnuicate detailly.
Since I start the ResizeWithPadCropd, the phase is holding on but the gpu memory is full. It is normal?
It is still on holding GPU. I want to ask this ResizeWithPadOrCropd is suited for train or test or both?
You should use this resize function in both training and testing. Well it is normal to have GPU in full. Is it still running and showing loss and evaluation mae with different epochs?
Ok! It is still running but only set in the epoch 0;
you mean it is still in epoch 0 and still evaluating?
Yes;
/root/userfolder/tonggan/anaconda3/envs/Xiong_poj/lib/python3.8/site-packages/torch/nn/functional.py:3657: UserWarning: The default behavior for interpolate/upsample with float scale_factor changed in 1.6.0 to align with other frameworks/libraries, and now uses scale_factor directly, instead of relying on the computed output size. If you wish to restore the old behavior, please set recompute_scale_factor=True. See the documentation of nn.Upsample for details. warnings.warn( optimization time: 5.730208873748779 [ 0/ 147 ( 0%)] A_to_B_Loss: 1.0015516 optimization time: 0.9895610809326172 [ 30/ 147 ( 20%)] A_to_B_Loss: 1.5312134 optimization time: 1.0017342567443848 [ 60/ 147 ( 41%)] A_to_B_Loss: 1.2967397 optimization time: 1.0088763236999512 [ 90/ 147 ( 61%)] A_to_B_Loss: 1.0642425 optimization time: 1.0104715824127197 [ 120/ 147 ( 82%)] A_to_B_Loss: 0.9033197 Total time per sample is: 151.69771456718445 Averaged loss is: 0.7906994 Execution time: 169.63 seconds
This code has run nearly 4 hours with GPU full.
can you try commemt out the training, run the evaluation using the untrained ddpm? Just need to comfirm it is working.
Ok I will try it. It's already late at night in China. I'll tell you about the test tomorrow morning.
不幸的是,我将训练部分注释,直接显存爆掉了; 相关的参数设置以及evaluation部分代码为: ` def evaluate(model,epoch,path,data_loader1,best_loss): model.eval() prediction = [] true = [] img = [] loss_all = [] mae_all = [] psnr_all = [] ssim_all = [] ncc_all = [] aa = time.time() with torch.no_grad(): for i, (x1,y1) in enumerate(data_loader1):
target = y1.to(device)
condition = x1.to(device)
with torch.cuda.amp.autocast():
sampled_images = inferer(condition,diffusion_sampling,model)
#x1 [1 1 84 102 84]
#y1 [1 1 91 108 91]
#sampled_images : [1 1 84 102 84]
#target : [1 1 91 108 91]
loss = metric(sampled_images,target)
# print('L1 loss: '+ str(loss))
img.append(x1.cpu().numpy())
true.append(target.cpu().numpy())
prediction.append(sampled_images.cpu().numpy())
loss_all.append(loss.cpu().numpy())
mae = compute_mae(true[-1],prediction[-1])
psnr = compute_psnr(true[-1],prediction[-1])
ssim = compute_ssim(true[-1],prediction[-1])
ncc = compute_ncc(true[-1],prediction[-1])
mae_all.append(mae)
psnr_all.append(psnr)
ssim_all.append(ssim)
ncc_all.append(ncc)
print('optimization time: '+ str(1*(time.time()-aa)))
# The save code, you can replace it by your code for other files, e.g. nii or dicom
data = {"img":img,
'label':true,
'prediction':prediction,
'loss':np.mean(loss_all),
'mae':np.mean(mae_all),
'psnr':np.mean(psnr_all),
'ssim':np.mean(ssim_all),
'ncc':np.mean(ncc_all)
}
scipy.io.savemat(path+ 'test_example_epoch'+str(epoch)+'.mat',data)
if np.mean(loss_all) < best_loss:
scipy.io.savemat(path+ 'all_final_test_another.mat',data)
return np.mean(loss_all),np.mean(mae_all), np.mean(psnr_all), np.mean(ssim_all), np.mean(ncc_all)
num_channels=64
attention_resolutions="16,8"
channel_mult = (1, 2, 3)
num_heads=[4,4,8]
window_size = [[4,4,4],[4,4,4],[4,4,2]]
num_res_blocks = [2,2,2]
sample_kernel=([1,1,1],[2,2,1],[2,2,1]),
BATCH_SIZE_TRAIN = 1*1
img_size = (128,128,128)
patch_size = (64,64,8)
spacing = (2,2,2)
patch_num = 2
channels = 1
metric = torch.nn.L1Loss()
`
我觉得是evaluation存在问题 一直占用但是 并不运行。
I don't have the chinese input in the lab computer so please let me do it in English. The code is good in my computer. Let's debug it step by step: add code "print(target.shape)"
Let's see if the error happen in the data loader.
Ok, I will try it!
Yes, the target shape is as you see. But the code is still running.
OK. The reader is correct. Then let's move to the next step:
Comment out this line, unindent the next line. print('aaaaaaaaaaaaa') below the line "sampled_image =......"
Yes,It seems that run smoothly.
OK. Now make your testing dataset only have 1 MRI and 1 PET. Run the whole evaluation.
ok So, it is just too slow?
Haha yes. I can see your network run 147 data for one timestep is 171 seconds. Lets say about 1 second for one patient one timestep, each patch needs 50 timestep. If one patient has size of 128x128x128, then we have 2x2x32 patchs (if patch size 64x64x4), so each patient should need about 2 hour maybe. It looks your first patient is small so only takes 500 seconds.
I use your model ,train paired MRI-PET nii data, but get an extremely bad output! I want to know why and how to deal with it. training_log.csv