jforjohn / canopyKmeans_improved

This is an implementation of the paper on "Improved K-means algorithm based on density Canopy".
30 stars 2 forks source link

FutureWarning: jaccard_similarity_score has been deprecated and replaced with jaccard_score. It will be removed in version 0.23. This implementation has surprising behavior for binary and multiclass classification tasks. --- #2

Closed zuihu closed 3 years ago

zuihu commented 3 years ago

Hi,may I ask what these warnings are and what impact they have?I've just come into contact with the k-means algorithm,What do these ERR, ARI, Rep, JC, AMI, TB, BD, SC mean,thanks! image

jforjohn commented 3 years ago

Hello,

the ERR is the same error metric as the one from the sklearn (the one used was the sum of squared distances). REP is the number of repetitions needed. The rest are the abbreviations of external and internal validation metrics. Check Validation.py to see which ones were used.

jforjohn commented 3 years ago

Concerning the warning it's what it says. The sklearn version used in the code and the one you have are different. Currently the results should not be affected.

zuihu commented 3 years ago

Thank you very much for your answer. The canopy algorithm can give the best K value, so I still need to k-Means.cfg Number of K values specified in the? Are those MyCanopyKmeans, MyKmeans++, StandCanopyKmeans, and so on, are the new eight algorithms resulting from the optimization of canopy and K-means algorithms?

zuihu commented 3 years ago

And do you need to use k-means algorithm to calculate the anchors size? Because I only see the number of K values are calculated

jforjohn commented 3 years ago

This is code compares the algorithms mentioned in the readme. The kmeans which uses the canopy as initialization doesn't need the k from the config file. This k is used for the rest of the algorithms. So, canopy can be applied as initialization for the rest of the algorithms (either their custom implementation or from any other package) but this is not implemented in this code, however, it doesn't need so many adjustments. I don't get what you mean by anchors size here.

zuihu commented 3 years ago

Thanks. The mean of anchors size is in the YOLOv3 9 anchors(1013, 1630, 33*23,etc), how can I get these candidate boxes by my datasets?

zuihu commented 3 years ago

If my data set is large enough to cause memory error, how can I modify the code? (because there is no problem with my Python version, I only have 16g of memory, and I can't run it in colab) image

jforjohn commented 3 years ago

well, Canopy needs a distance matrix. At first try it in a much smaller dataset to see how it works. Then you need maybe to calculate the distance matrix separately. Try to do it in a small script in your machine otherwise try it in another machine and read it from Canopy code instead of calculating it. With 16GB you can store in memory up to ~1300x1300 matrix or ~2600x2600 with int32. I think if you don't have the distance matrix it will make the algorithm much slower but this is something you need to experiment with and check the paper it was derived from again.

zuihu commented 3 years ago

ok,thanks