Closed EvgeniyaMartynova closed 4 years ago
We probably should not worry about it too much - the estimate is good when standard deviation is used as bandwidth
Although at each step the distribution of generator will be different we won't determine bandwidth with cross-validation each time. Instead, we determine it for target distributions and use the best target distribution bandwidth for generator KDE. For each true distribution, 5000 samples were used.
The results of cross-validation: For 8 gauss and 25 gauss with 0.05 stdev we run CV on 41 linearly spaced values in range [0, 0.5], so [0, 0.0125, 0.025, ...]. For both 0.025 bandwidth was chosen.
For 8 gauss with 0.02 stdev we run CV on 41 linearly spaced values in range [0, 0.4], so [0, 0.01, 0.02, ...]. 0.01 bandwidth was chosen.
For 8 gauss with 0.2 stdev we run CV on 21 linearly spaced values in range [0, 1], so [0, 0.05, 0.1, ...]. 0.1 bandwidth was chosen.
Interestingly in all cases, the optimal bandwidth equals to stdev/2.
I found that when using half standard deviation bandwidth, on the KDE plots we get the distribution is hardly visible, because stdev is very small, especially for 8 Gaussians. I tried a few options and 3*standard deviation bandwidth looks good, so we will use it for plotting. For data log-likelihood calculation the optimal bandwidth will be used.
Now KDE for a generator sample is obtained with
KernelDensity
class fromsklearn.neighbors
. Fileutils.py
,save_kde
.bandwidth
parameter is essential for an accurate distribution estimate. I used the standard deviation of Gaussians in the mixture and got quite good results. But, apparently, an optimal value should be selected with cross-validation. It took too long and I did not finish it.