Open adam2392 opened 2 years ago
Hey @adam2392, I worked mainly in the R part of things although I do remember having a similar issue with this chunk (lots of whiteboarding). I never did figure out if this block was sampling in accordance with the SPORF paper -- and given your example, I'd say it's not.
In that case the indices should be sampled without replacement -- going from memory.
I did tinker around in the C++ code, but the base functions came from James. I know James had some code in his own repo, which may have some tests in it 🤷🏼♂️ -- he'd be the one with the most knowledge about how it works.
Hi @MrAE, @jbrowne6 and @falkben
Just pinging the ppl that seemed to touch these specific LOC.
I know you guys don't maintain this code anymore and have moved on, but I had a quick question in terms of what a specific line is doing. I was wondering if you could provide a quick answer (if you happened to write this part) to make sure I'm interpreting correctly. FYI: I have ported the code to cython and once this issue is resolved, I think we can safely move on :)
In https://github.com/neurodata/SPORF/blob/a7a3c7e6df457b722de86d7254f8a7724b27978f/packedForest/src/forestTypes/binnedTree/processingNodeBin.h#L99-L113 are you sampling without replacement the feature index? It looks like
rndFeature = randNum->gen(fpSingleton::getSingleton().returnNumFeatures());
can generate a random feature index, but is it possible to have a duplicate?For example, say you have data with 4 columns, then maybe SPORF will sample a projection of:
Note that this in turn isn't a sparse linear combination with only +/- 1's, but now has a +2, -1 weight when doing the linear combination. Or is this function guaranteed to not have duplicates in its sampling of the projection matrix?