neurodata / SPORF

This is the implementation of Sparse Projection Oblique Randomer Forest
https://neurodata.io/forests/
97 stars 46 forks source link

Short Question About Implementation #359

Open adam2392 opened 2 years ago

adam2392 commented 2 years ago

Hi @MrAE, @jbrowne6 and @falkben

Just pinging the ppl that seemed to touch these specific LOC.

I know you guys don't maintain this code anymore and have moved on, but I had a quick question in terms of what a specific line is doing. I was wondering if you could provide a quick answer (if you happened to write this part) to make sure I'm interpreting correctly. FYI: I have ported the code to cython and once this issue is resolved, I think we can safely move on :)

In https://github.com/neurodata/SPORF/blob/a7a3c7e6df457b722de86d7254f8a7724b27978f/packedForest/src/forestTypes/binnedTree/processingNodeBin.h#L99-L113 are you sampling without replacement the feature index? It looks like rndFeature = randNum->gen(fpSingleton::getSingleton().returnNumFeatures()); can generate a random feature index, but is it possible to have a duplicate?

For example, say you have data with 4 columns, then maybe SPORF will sample a projection of:

indices = [0, 2, 0]
weights = [1, -1, 1]

Note that this in turn isn't a sparse linear combination with only +/- 1's, but now has a +2, -1 weight when doing the linear combination. Or is this function guaranteed to not have duplicates in its sampling of the projection matrix?

MrAE commented 2 years ago

Hey @adam2392, I worked mainly in the R part of things although I do remember having a similar issue with this chunk (lots of whiteboarding). I never did figure out if this block was sampling in accordance with the SPORF paper -- and given your example, I'd say it's not.

In that case the indices should be sampled without replacement -- going from memory.

I did tinker around in the C++ code, but the base functions came from James. I know James had some code in his own repo, which may have some tests in it 🤷🏼‍♂️ -- he'd be the one with the most knowledge about how it works.