tlk00 / BitMagic

BitMagic Library
http://bitmagic.io
Other
412 stars 48 forks source link

Binary self organizing map #53

Open fsicre opened 4 years ago

fsicre commented 4 years ago

Hi,

First I want to thank you for this awesome library.

I'm experimenting on binary NLP technics and I was wondering if you could publish a bitmagic version of SOM (self organizing map) using about ~ 10 millions of extremely sparse (< 5 / 100 000) binary vectors of same size ~ 1Mbits. The map could be for instance 128 x 128 locations.

It could be a huge step froward for me to see a correct use of bitmagic on this particular subject.

Thanks for your help.

tlk00 commented 4 years ago

Are you asking for an example on how to use BM? Could you provide more info on the case, I am not sure I understand the 128x128 locations problem really well.

Thank you!

fsieduc commented 4 years ago

Are you asking for an example on how to use BM?

yes, exactly, on the use case of a Self Organizing Map of binary vectors.

Usually, SOM are used on integers or float vectors as you can see on these to pages : https://en.wikipedia.org/wiki/Self-organizing_map https://www.google.com/search?q=self+organizing+map&tbm=isch

In my case I need to do this kind of classification on sparse binary vectors.

BitMagic seems to be a very very good candidate for that.

The naïve SOM algorithm is very simple :

The map will progressively evolved and finally each vector to classify will have a best location identified and all similar vectors in the ~ 10 millions will also be associated to locations nearby on the map.

Now imaging you have ~ 10 millions of extremely sparse binary vectors (same size ~ 1Mbits) and you want to place them on a 2D map of 128 x 128 locations. Hamming Distance can be used. The map will be a 2D matrix of binary vectors of ~ 1Mbits. The algorithm on the picture below will be used in order to "hack" the locations' vectors.

As you can see, a lot of vectors will be loaded in RAM, a lot of BV comparisons will be performed, a lot of hamming distance will be calculated, a lot of bit modification will be done, so the question is

how to correctly use BitMagic for best / maximum speed ? (32/64/128Go RAM available)

https://www.semanticscholar.org/paper/An-alternative-approach-for-binary-and-categorical-Santana-Morais/f1f079640218b72822911ef3dc7394765703d62a/figure/0 https://d3i71xaburhd42.cloudfront.net/f1f079640218b72822911ef3dc7394765703d62a/3-Figure1-1.png

tlk00 commented 4 years ago

Yes, i understand the problem. The closest example publicly available now is: https://github.com/tlk00/BitMagic/tree/master/samples/xsample07a

It creates binary fingerprints and implements primitive (but relatively fast) clusterization using similarity measure (COUNT_AND). Not quite well documented yet, but the code should be instructive as a use case for how to compute binary distances.

I am planning to create a separate example on how to do a more efficient distance computations for Hamming, Jacquard distances using cache locking and SIMD. Current distance compute is fast but can be even faster actually to facilitate binary SOMs.

fsieduc commented 4 years ago

Awesome, many thanks for the hint.

fsieduc commented 4 years ago

xsample07a is very interesting, although it's difficult to understand exactly why you've choosen some technics. never mind.

this project is a side project for me.

once I have written something (in few weeks) would you mind my showing to you ? in order to have your advices concrete bitmagic usage.

tlk00 commented 4 years ago

Surest thing. I would certainly appreciate your example, please share it.