instable results - Githubissues

immaryw commented 3 years ago

Hi there,

I'm using the package to calculate density ratio for multi-dimensional data. I run the program many times using the same training and test datasets, but the estimated density ratio are often slightly different. Is there a way to make the result more stable.

Another question involves how to set sigma and lambda search range. These hyper-params affect the estimations too much!

Thanks in advance!!

mierzejk commented 3 years ago

Hi @immaryw,

I am not sure if the repo is being maintained by the owner, as over a year ago I submitted a pull request (#9) and an issue (#10), but no reaction has ensued whatsoever. Yet you might be interested in my branch, that can possibly alleviate your concern. The branch lacks an updated README, but the only configurable feature is thoroughly described in #9, and by default it offers significant performance improvement by applying proper numpy vectorization.

The said instability is on account of the way kernel centers are being randomly picked from the x vector. Namely, with the use of the numpy.random.randint method, that is a simple variation with repetition (replacement). Your options are:

Use numpy.random.seed and set the pseudo-random number generator seed to a constant value. I know, the function is somehow deprecated and being discouraged, but this is how densratio_py is implemented.
Go for my branch, where numpy.random.randint has been superseded with unique numpy.random.choice (without replacement). Furthermore, by applying numpy.percentile the choice is stratified with respect to (possibly multivariate) x values. Please refer to the semi_stratified_sample function. This approach should increase the result stability.
Set the number of kernels equal to the length of x; or greater, but it is effectively the same as equal to. In case of my branch, it determines all x values become kernel centers. While for the original repo there is still the with replacement factor.

Obviously you can combine any two, or even all of the list points.

Best regards, Chris

mierzejk commented 3 years ago

As regards lambda and sigma, right now you just provide a list of values whose Cartesian product is evaluated, and a pair that results in the least error is selected (a grid search approach). Perhaps the automated machine learning (AutoML) approach could be taken up to expedite the process, but it would definitely require some core source code modification to get it working.

immaryw commented 3 years ago

@mierzejk I appreciated your awesome work and very helpful reply!! I installed your branch and the result looks more stable now 👍

hoxo-m / densratio_py

instable results #13