perceivelab / PLAN

MIT License
1 stars 0 forks source link

Training, Validation dataset partition for auxiliary classifier ϕ_class #2

Closed Huiyu-Li closed 7 months ago

Huiyu-Li commented 10 months ago

Hello,
As you said the auxiliary classifier ϕ_class performs identity classification with a single image per identity. In this case, it means every single image with a unique identity So, I am curious about the training, and validation dataset partitioning toghether with their corresponding labels.

Thanks a lot!

matteo-pennisi commented 9 months ago

Hello, as your question is regards identity classification I suppose that you are referring to ϕ_id (identity classifier) and not ϕ_class that is the label classifier.

As you said the auxiliary classifier ϕ_class performs identity classification with a single image per identity. In this case, it means every single image with a unique identity

You are correct, in our case, we have a dataset with one image for each identity. But as stated in the paper these are augmented by classic medical imaging augmentations and multiple GAN projections so in practice for each identity we have the original image + N GAN projections and during training they are randomly augmented.

So, I am curious about the training, and validation dataset partitioning together with their corresponding labels.

The train/val/split reported in the paper is done for the evaluation of the downstream classifier ϕ_down and the GAN training. The goal of the identity classifier is to recognize identities inside the training set that are the same as the one of the Generator to give signals to avoid them. For this reason, there is no a validation and a test set for ϕ_id but you can train it until the loss stops decreasing.

So in practice you can training your identity classifier as a standard classification training where the labels are just the patient ids and images are augmented as stated above. We adopted a Resnet18 with the last layer with as many heads as the number of identities. Relatively to your other issue #1 for now I think that you have all the elements to train your identity classifier as it is a standard classifier training procedure but I hope in the future to put some mockup code to make this point clearer.

matteo-pennisi commented 9 months ago

I close the issue for now but feel free to reopen it if you need further clarifications :)

Huiyu-Li commented 7 months ago

Hello, Great thanks for your reply before. I was wondering about the number of identities in identity classification. I was just wondering How many identities would be in identity classification. If treated as a multi-class classification problem, the number of identities should be equal to the size of network predictions. And when you partition the dataset into trainSet, validSet and testSet, How do you make sure each subset of them has overlapping Identities? If you randomly separate them into three subsets, It seems like the validSet/testSet may have new identities that are not present in trainSet. Thanks in advance!

matteo-pennisi commented 7 months ago

You are correct, there are no overlapping identities in train/val/test split. The identity classifier (trained on the training set) is never evaluated against val and test as its only objective is to provide informations about how to stay away from training set images when sampling from the GAN. Val and test splits are used for the evaluation of the downstream classifier (the classifier trained on the generated images) and for the Membership Inference Attack.

Huiyu-Li commented 7 months ago

Great thanks for your answer. Could you please give more information about the number of identities in the training set? If formulated identity classification as a multi-class classification problem, the number of identities should be equal to the size of network predictions. So I just wondering about the size of identity classifier outputs. Thanks in advance.

matteo-pennisi commented 7 months ago

As you said the size of network predictions is equal to the number of identities in the training set. As stated in the paper this value is set to 70% of the original dataset, 10% for the validation and 20% for test. In our specific case we had 1 image for each identity so the number of identities is the 70% of the number of images.