meszlili96 / NaturalComputing

1 stars 0 forks source link

Tune bandwidth parameter of KernelDensity #11

Closed EvgeniyaMartynova closed 4 years ago

EvgeniyaMartynova commented 4 years ago

Now KDE for a generator sample is obtained with KernelDensity class from sklearn.neighbors. File utils.py, save_kde. bandwidth parameter is essential for an accurate distribution estimate. I used the standard deviation of Gaussians in the mixture and got quite good results. But, apparently, an optimal value should be selected with cross-validation. It took too long and I did not finish it.

EvgeniyaMartynova commented 4 years ago

We probably should not worry about it too much - the estimate is good when standard deviation is used as bandwidth

EvgeniyaMartynova commented 4 years ago

Although at each step the distribution of generator will be different we won't determine bandwidth with cross-validation each time. Instead, we determine it for target distributions and use the best target distribution bandwidth for generator KDE. For each true distribution, 5000 samples were used.

The results of cross-validation: For 8 gauss and 25 gauss with 0.05 stdev we run CV on 41 linearly spaced values in range [0, 0.5], so [0, 0.0125, 0.025, ...]. For both 0.025 bandwidth was chosen.

For 8 gauss with 0.02 stdev we run CV on 41 linearly spaced values in range [0, 0.4], so [0, 0.01, 0.02, ...]. 0.01 bandwidth was chosen.

For 8 gauss with 0.2 stdev we run CV on 21 linearly spaced values in range [0, 1], so [0, 0.05, 0.1, ...]. 0.1 bandwidth was chosen.

Interestingly in all cases, the optimal bandwidth equals to stdev/2.

EvgeniyaMartynova commented 4 years ago

I found that when using half standard deviation bandwidth, on the KDE plots we get the distribution is hardly visible, because stdev is very small, especially for 8 Gaussians. I tried a few options and 3*standard deviation bandwidth looks good, so we will use it for plotting. For data log-likelihood calculation the optimal bandwidth will be used.