About the cluster size analysis and the uniform distribution assumption for each cluster

salesforce / PCL

PyTorch code for "Prototypical Contrastive Learning of Unsupervised Representations"

MIT License

570 stars 83 forks source link

About the cluster size analysis and the uniform distribution assumption for each cluster #9

Closed ChangLee0903 closed 3 years ago

ChangLee0903 commented 3 years ago

I very much appreciate this work, and thank you for providing the implementation code. From your derivation of the proto loss term, the cluster size P(c; theta) would be assumed to 1/k, but you mentioned that each cluster might have imbalance problems in your balance analysis part. I am just curious that why you made this assumption instead of calculating each cluster's sample numbers. Have you tried to conduct such settings in your experiments?

BTW, could you tell me that what is the effect on the performance as the imbalance problem happens?

best, Chi-Chang Lee.

LiJunnan1992 commented 3 years ago

Hi, thanks for the question! The uniform assumption is a prior we have on the distribution of prototypes. It does not necessarily correlate with the cluster size.

I have observed the performance to drop by a few percent if the clusters are imbalanced.

Hzzone commented 3 years ago

Maybe it is used to avoid the trivial solution of clustering, please see Unsupervised Learning of Visual Features by Contrasting Cluster Assignments.

ChangLee0903 commented 3 years ago

@Hzzone Thanks for your explanation! I am just curious that whether the balanced cluster sizes help the performance. Intuitively, I would imagine the clusters as some high-level attributes. Assuming that the imbalanced cluster sizes setting wouldn't lead to the trivial solution, I don't think each attribute really needs to keep the same size, and what I want to know is how the cluster sizes affect the performance. But anyway, my question has been answered, thanks a million!