outbrain / outrank

A Python library for efficient feature ranking and selection on sparse data sets.
https://dl.acm.org/doi/10.1145/3604915.3610636
BSD 3-Clause "New" or "Revised" License
19 stars 3 forks source link

Rare-first sampling of combinations #39

Closed SkBlaz closed 1 year ago

SkBlaz commented 1 year ago

By default, random subspaces were considered each batch. A more optimal algorithm considers the least sampled combinations each bach, overall increasing the efficiency of sampling (|F| / k (|F|=num features, k = num batches) samples are required to cover all features. The guarantee for uniform sampling is much worse, can be derived from harmonic series actually -> image

with image