about run-kmeans - Githubissues

Hiiragi0107 commented 3 years ago

When I look at run-kmeans, it uses L2 distance for clustering, is L2 distance better than cosine distance?

index = faiss.GpuIndexFlatL2(res, d, cfg)

LiJunnan1992 commented 3 years ago

Hi, since the features are normalized, L2 distance and cosine distance have the same results.

Hiiragi0107 commented 3 years ago

I see. I'm sorry for my ignorance. T_T

Hzzone commented 3 years ago

As I know, if the cluster centers are not normalized, the l2 distance is not equal to cosine. Faiss described that we should use spherical parameter, is it right?

Hzzone commented 3 years ago

Dear authors, I first thank for your paper that gives me a simple yet effective approach. I have reproduced it and improved a lot on CIFAR10 with MoCo architecture. However, I have another issue: The dataset was distributed to each process and the features were all reduced. As in https://github.com/salesforce/PCL/blob/018a929c53fcb93fd07041b1725185e1237d2c0e/main_pcl.py#L304, the repeat samples are divided by 2 (they are not equal because the samples are distributed to different process, causing that they are not normalized).

However, during getting the features and running kmeans, there are label leakage since the dataset has not been shuffled. This can be proved that the performance drops if I shuffled the datasets to run kmeans, which may be 0.5% in accuracy of clustering.

I mean that, the repeat samples usually appear in the boundary of each classes, if the datasets are not shuffled.

LiJunnan1992 commented 3 years ago

Hi, thanks for raising this issue up!

You are right that the few repeated features should be re-normalized. However, do you observe a big difference between features from different processes on the same image? We expect them to be mostly the same, because models distributed across processes only slightly differ in their BN stats.

It is unclear to me why the repeated samples usually appear in the boundary of each class. It is unlikely for ImageNet because there are only 8 processes but there are 1000 classes.

Hzzone commented 3 years ago

@LiJunnan1992 Thanks for your response.

It is unclear to me why the repeated samples usually appear in the boundary of each class. It is unlikely for ImageNet because there are only 8 processes but there are 1000 classes.

The dataset has not been shuffled so that the duplicate items have always been located in the boundary between different classes. This can be seen from https://pytorch.org/docs/stable/_modules/torch/utils/data/distributed.html#DistributedSampler

The results would not be very significant. For cifar10, there was about 0.5 of clustering accuracy.

However, I have tried my own idea, I have even achieved 95% accuracy for cifar10, resnet18 in the same to gather the features.

Anyway, thanks for your answer.

MaxTorop commented 2 years ago

@Hzzone I'm really curious what hyperparameters you used to get good results with CIFAR10? Did you make any other changes to the code? (i.e. I assume you used a CIFAR10 style resnet and the common CIFAR10 augmentations). Also, was your 95% accuracy 95% Acc@Proto, or was that with a linear/knn probe?

This would all be super helpful for me :-) !

@LiJunnan1992 Thanks for your response.

It is unclear to me why the repeated samples usually appear in the boundary of each class. It is unlikely for ImageNet because there are only 8 processes but there are 1000 classes.

The dataset has not been shuffled so that the duplicate items have always been located in the boundary between different classes. This can be seen from https://pytorch.org/docs/stable/_modules/torch/utils/data/distributed.html#DistributedSampler

The results would not be very significant. For cifar10, there was about 0.5 of clustering accuracy.

However, I have tried my own idea, I have even achieved 95% accuracy for cifar10, resnet18 in the same to gather the features.

Anyway, thanks for your answer.

SUNziwei0527 commented 1 year ago

@LiJunnan1992 感谢您的答复。

我不清楚为什么重复的样本通常出现在每个类的边界中。对于 ImageNet 来说不太可能，因为只有 8 个进程，但有 1000 个类。

数据集没有被打乱，因此重复项始终位于不同类之间的边界。这可以从https://pytorch.org/docs/stable/_modules/torch/utils/data/distributed.html#DistributedSampler看到

结果不会非常显着。对于 cifar10，聚类精度约为 0.5。

然而，我尝试了自己的想法，我什至在 cifar10、resnet18 中实现了 95% 的准确率来收集特征。

不管怎样，谢谢你的回答。

I'm also curious about how you were implemented. Could you share some insights?

salesforce / PCL

about run-kmeans #6