jmonlong / Hippocamplus

A place to gather useful information I keep on forgetting.
http://jmonlong.github.io/Hippocamplus/
GNU General Public License v3.0
30 stars 19 forks source link

Implemented in other programming languages for ClusterEqualSize #3

Open xiaogu-space opened 5 years ago

xiaogu-space commented 5 years ago

Hi,I'm interested in ClusterEqualSize https://github.com/jmonlong/Hippocamplus/blob/master/content/post/2018-06-09-ClusterEqualSize.Rmd but I won't use R,is there any other language like "python,java,node"?

jmonlong commented 5 years ago

Hello, I haven't implemented this in other languages but I'm sure there are ways to reproduce the methods. For example, Python must have some modules to do hierarchical clustering or distance computation for nearest neighbors. I've always done these kind of things at the same time as data visualization so I use R for this.

xiaogu-space commented 5 years ago

Ok,I will always follow this project.

tomicapretto commented 5 years ago

Hi, I really like what you did! I am doing something similar for one of the tasks in my thesis, which is related to spatial clustering of equal sizes. I am doing something similar than what is being done here: https://statistical-research.com/index.php/2013/11/04/spatial-clustering-with-equal-sizes/

However, I am creating sampling areas in each city of Texas, and some cities are very big and consequently time-consuming. I tried to use your approach, but having a matrix of 100k100k is untreatable. Even more, one of the cities would produce a 1m1m matrix, because it has 1m households.

My question is the following: Have you think in other way of calculating/storing distances that consumes less memory? Thank you

jmonlong commented 5 years ago

NIce! Good to know it might be useful to someone else.

I don't know much about better ways to store distances. Maybe R or other languages can use disk-stored rather than memory-stored distance object, but I've never used them. That might help for your memory issue but the computation might still take too long.

A more feasible solution for large datasets could be to use either the kmean or kNN approach as they don't necessarily require to compute all the distances. The three easiest things I would try:

  1. Try using the kmean function kmvar. It didn't perform well in my tests but at least scales well and should run on your data.
  2. Use kmeans to cut your big cities in smaller blocks (but still much bigger than your final cluster size) and then use your favorite approach on each block.
  3. Tweak the nnit function to compute only the distances it needs. In that case you can only use the "random" strategy: pick a point randomly, compute distances between this point and all other points, define cluster with nearest points, etc. Let me know if you need help with that.
tomicapretto commented 5 years ago

Thank you a lot! I find your suggestions very useful. I will try to implement some of them. I will let you know if it works successfully :)

tomicapretto commented 5 years ago

@jmonlong Thanks a lot for the ideas. I am running a program that uses the second idea you suggest and it is working perfectly. If you are interested I can share with you the results and more about the thesis.