mueller-franzes / medfusion

Implementation of Medfusion - A latent diffusion model for medical image synthesis.
MIT License
165 stars 32 forks source link

Memorizing training images? #3

Open prash-p opened 1 year ago

prash-p commented 1 year ago

I was wondering if any tests were done to check whether the generated/sampled images were resized copies of the training images? Is this a potential problem with the VAE decoder or diffuser overfitting during training?

mueller-franzes commented 1 year ago

Yes, this is indeed a well-known problem. With GANs, the problem can occur as a "mode collapse" (https://developers.google.com/machine-learning/gan/problems#mode-collapse). I have tried to measure this using various metrics in the paper.

I also evaluated the VAE with independent test data (never seen before by the model) and could not detect any overfitting.

Does that answer your question?

prash-p commented 1 year ago

Thanks for answering! I am training medfusion on my dataset which is smaller than your experiments (~13,000 images, 256x256 px), and I am concerned that the VAE might learn latent embeddings which when sampled would produce data too close to the training set. How can I control the regularization - I was expecting to be able to do something like adjust the weight of the KL loss in the VAE?

mueller-franzes commented 1 year ago

You can simply train the VAE on a public dataset or use the pre-trained VAE I uploaded. If you are training from scratch, I would recommend setting embedding_loss_weight=0 (it seems beneficial both for the performance of the VAE and for training the diffusion model later).

You can check with sample_latent_embedder.py or evaluate_latent_embedder.py how the image quality is before and after the VAE.

If you have time, let me know if it works, I'm curious.

prash-p commented 1 year ago

Sounds good, I'll play with the embedding_loss_weight parameter and see what happens. So far I have managed to train from scratch and see good performance, however ~30-40% of the generated images seem to be very similar to images in the training set.

prash-p commented 1 year ago

Small update: Training a VAE and diffusion pipeline from scratch on a dataset of 13000 images (class 0: 11000, class 1: 2000) results in FID: 90, Precision: 0.719, Recall: 0.022 As mentioned above, a large proportion of images are very similar to training set images. How would I control the training to make the recall larger? You mentioned that

I have tried to measure this using various metrics in the paper.

Which metrics exactly would measure this overfitting? Here are some examples I am seeing where generated data is very similar to training data: compare_test_1_0 compare_test_None_4

mueller-franzes commented 1 year ago

Thanks for sharing the results, it's very interesting! As you say yourself, you want to make the recall bigger (measures diversity). You have a recall of 0.02 which is extremely low (my lowest value was 0.32). If you use Classifier Free Guidance then you should try to reduce the scaling. As you can see in Table 4 ( https://arxiv.org/abs/2105.05233 ), classifier guidance improves precision at the cost of recall. I hope this helps :)

prash-p commented 1 year ago

Thanks! I appreciate your responses. I will take a look at the code and parameters in more detail to see how the training/overfitting can be improved. I trained on another much larger dataset (144k images), and found that the recall was much better without changing any parameters and there was no obvious image regurgitation.

fengchuanpeng commented 1 year ago

Hello, why am I in the imported their own data set has been an error. ValueError: num_samples should be a positive integer value, but got num_samples=0

Shame-fight commented 1 year ago

Small update: Training a VAE and diffusion pipeline from scratch on a dataset of 13000 images (class 0: 11000, class 1: 2000) results in FID: 90, Precision: 0.719, Recall: 0.022 As mentioned above, a large proportion of images are very similar to training set images. How would I control the training to make the recall larger? You mentioned that

I have tried to measure this using various metrics in the paper.

Which metrics exactly would measure this overfitting? Here are some examples I am seeing where generated data is very similar to training data: compare_test_1_0 compare_test_None_4

Hello, is your dataset public and available? I noticed your data is a thyroid ultrasound?

prash-p commented 1 year ago

Small update: Training a VAE and diffusion pipeline from scratch on a dataset of 13000 images (class 0: 11000, class 1: 2000) results in FID: 90, Precision: 0.719, Recall: 0.022 As mentioned above, a large proportion of images are very similar to training set images. How would I control the training to make the recall larger? You mentioned that

I have tried to measure this using various metrics in the paper.

Which metrics exactly would measure this overfitting? Here are some examples I am seeing where generated data is very similar to training data: compare_test_1_0 compare_test_None_4

Hello, is your dataset public and available? I noticed your data is a thyroid ultrasound?

Sorry, this is bone surfaces in ultrasound and not thyroid. The dataset will be made public shortly (currently being reviewed for publication).

Shame-fight commented 1 year ago

Small update: Training a VAE and diffusion pipeline from scratch on a dataset of 13000 images (class 0: 11000, class 1: 2000) results in FID: 90, Precision: 0.719, Recall: 0.022 As mentioned above, a large proportion of images are very similar to training set images. How would I control the training to make the recall larger? You mentioned that

I have tried to measure this using various metrics in the paper.

Which metrics exactly would measure this overfitting? Here are some examples I am seeing where generated data is very similar to training data: compare_test_1_0 compare_test_None_4

Hello, is your dataset public and available? I noticed your data is a thyroid ultrasound?

Sorry, this is bone surfaces in ultrasound and not thyroid. The dataset will be made public shortly (currently being reviewed for publication).

Thank you for your reply. May I ask what is the overall process for training and generating data. I train according to prompts on my own dataset. But the final generation quality is very poor, and there is a significant gap between it and the original dataset. I am not sure which code needs to be modified

prash-p commented 1 year ago

@Shame-fight where are you getting stuck?

I followed the instructions the authors provided in the readme.md file and the vae and diffusion model training and sampling worked as expected.

taranrai commented 1 year ago

Small update: Training a VAE and diffusion pipeline from scratch on a dataset of 13000 images (class 0: 11000, class 1: 2000) results in FID: 90, Precision: 0.719, Recall: 0.022 As mentioned above, a large proportion of images are very similar to training set images. How would I control the training to make the recall larger? You mentioned that

I have tried to measure this using various metrics in the paper.

Which metrics exactly would measure this overfitting? Here are some examples I am seeing where generated data is very similar to training data: compare_test_1_0 compare_test_None_4

Hi, how many epochs and what other hyperparameters did you use to train this model?

Has anyone used the provided pretrained weights for a pathology dataset? my dataset is around 17,000 samples from one tissue type (512 x512 pixels) and wondering if there is any way to optimise for my use case. I'm also wondering if there is a minimum number of epochs required for training

prash-p commented 1 year ago

@taranrai Just start with the default hyperparameters and monitor your performance as the model trains. The best checkpoints are saved along the way. For the VAE I have found 50-100 epochs is sufficient, and for the diffusion model itself I found 20-50 epochs is sufficient.

taranrai commented 11 months ago

Small update: Training a VAE and diffusion pipeline from scratch on a dataset of 13000 images (class 0: 11000, class 1: 2000) results in FID: 90, Precision: 0.719, Recall: 0.022

As mentioned above, a large proportion of images are very similar to training set images. How would I control the training to make the recall larger?

You mentioned that

I have tried to measure this using various metrics in the paper.

Which metrics exactly would measure this overfitting?

Here are some examples I am seeing where generated data is very similar to training data:

compare_test_1_0

compare_test_None_4

Many thanks for this evaluation. What method did you use to check whether images are copies from the training set? I've used perceptual hashing for some random samples (and with different orientations as augmentations were applied). However, I'm aware that differences in brightness and contrast can sometimes generate very different hash values between two similar images. I'm just wondering if there's a quicker alternative for checking copies that is more efficient than using a feature extraction approach...

prash-p commented 11 months ago

It was a very simple brute force search of looking at the SSIM between random sampled images against every training set image and then visualizing the image pairs with the highest SSIM.

I think a more efficient approach would be to save encodings of all training images using imagenet or another convolutional encoder and then find the shortest distance of a medfusion generated image encoding to the set of training image encodings.