vigsivan / RWCNet

Official implementation of Recurrence with Correlation Network for Medical Image Registration
MIT License
9 stars 4 forks source link

Retrain RWCNet using Oasis Data set from Learn2Reg #2

Closed bhatrana closed 1 year ago

bhatrana commented 1 year ago

Hi Vignesh,

@vigsivan @mrhardisty

Thank you for this nice work to solve the registration problem in medical imaging. Basically, I was trying to retrain this model for Oasis data to reproduce the result. The steps in each stages are set based on the paper.

I have two questions.

1) I used NVIDIA TITAN RTX with 24 GB RAM , which took around 4 weeks to train the whole model. How you managed to train on NVIDIA A100 - 32 GB in around 30 hours ? Can you please share the configuration/request hardware allocations ?

2) After 4 weeks of training, I was not able to get the expected DSC value quoted in the paper.

Here are some of the tensor board graph and validation results.

image image image

Please feel free to write so as to find and avoid possible errors, I am making in this training and validation.

FYI,

The used training syntax for this repo is :

python l2r_train_eval.py /home/rranabhat/OASIS/OASIS_dataset.json train_config.json

and for validation : python eval.py oasis_files_paper/checkpoints/data.json eval_config.json

Later on all metrices were measured based on the Lear2Reg evaluation repo.

Thank you.

Best regards Raj

vigsivan commented 1 year ago

Hi @bhatrana , can you please share your config?

vigsivan commented 1 year ago

Re: gpu and training speeds, if you trained on Bender, it could be that it has slow filesystem access (maybe its a different issue). I was also able to leverage a faster GPU (A100 64GB) which made training a lot faster for me.

animesh-007 commented 1 year ago

Hi @bhatrana, How did you get the DSC value? I also used the same command for validation, but I am not able to see any logging of the DSC value for the OASIS dataset.

bhatrana commented 1 year ago

@vigsivan : image. It seems, I made some change in https://github.com/vigsivan/RWCNet/blob/d90b48f7377b0b69dd655c625772be0328bf166b/networks.py#L354 to resolutions: List[int]=[4,4] due to ongoing cuda error in compute Canada, but later on forgot to put default one while training in bender.

I am training again the network with default value, will update the result by next week.

Is GPU (A100 64GB) allocation based in compute Canada ?

Thanks. !!

bhatrana commented 1 year ago

@animesh-007 : This repo is based on Learn2Reg , https://learn2reg.grand-challenge.org/ . You can do evaluation/test using eval.py which allows to store displacement field of each pair.

To quantify the results using several metrices, you can use the following repo which is also based on L2R.

https://github.com/MDL-UzL/L2R/tree/main/evaluation

Thanks

animesh-007 commented 1 year ago

Thanks @bhatrana. The link you shared was very helpful. I was able to calculate metrics offline. But the evaluation script doesn’t give the Dice metric as reported in the RWCNet paper. Can you guide me on how I can calculate the Dice metric?

vigsivan commented 1 year ago

Please see the weights in the train_oasis folder regarding reproducibility of this issue. A100s are available on Narval.

I'm closing this issue, but feel free to follow-up if you come across anything else