Anisotropic KDEs via PCA

tommyod commented 5 years ago

Currently, the bandwidth in d dimensions is bw * np.eye(d)---the covariance matrix is a multiple of the identity. As a result, the KDE works best if anisotropic data is shifted, rotatated and scaled before sent into the algorithm.

Having a routine for this built into the library would be nice.

@blasern suggested implementing anisotropic KDE. We agree that the following might work:

Shift data to have mean zero. Call this transformation f.
Apply the whitening transformation using the PCA / SVD. Call this g.
Fit a KDE.

Transform back using f^{-1}(g^{-1}(D)), make sure to scale the volume to preserve an integral of unity (determinant of the transformation).

            KDE                       
 Data  -------------> KDE estimate    

    |                  |              
g o f |                  |   f^-1 o g^-1
    |                  |              
    v                  v              
       ------------->                 
Transformed    KDE      KDE estimate     
Data

The above would implement very general KDEs in arbitrary dimensions.

blasern commented 5 years ago

After applying the whitening transformation to the grid points, the grid is no longer axis aligned. I haven't checked linbin_Ndim carefully, but I could imagine implementations where that becomes problematic. Can linbin_Ndim handle non-axis-aligned grid points?

tommyod commented 5 years ago

@blasern : linbin_Ndim takes arbitrary data as input, and returns an axis-aligned grid with densities as output. So the answer is no.

However, I believe that it does not matter, since the whitening transform is performed before a grid is indued. An axis-aligned grid is induced on the rotated space y, when the KDE is computed. The result in the original space x is a non-axis aligned grid, along with a density estimate f(x). From here, there are two options: (a) plot the data if matplotlib can handle plotting densities on a non-axis aligned grid (I have not checked), or (b) re-use the linbin_Ndim routine to obtain an axis-aligned grid in x-space.

Here's a more verbose diagram, which commutes (up to approximations of data, etc):

    (1)         KDE in x-space         (4)                
    Original      ------------>  KDE estimate f(x)        
    data x                       on grid which is non-axis
                                 aligned in x-space       
        |                               ^                 
 Shift  |                               | Unwhiten        
 Whiten |                               | Unshift         
        v                               |                 
    Transformed                  KDE estimate f(y)        
    data y        ------------>  on grid which is         
    (2)         KDE in y-space   axis aligned in y-space  
                                       (3)

(1) We have raw data, and no grid. (2) The data is transformed by affine transformations. (3) The KDE induces an axis-aligned grid in y-space, along with an estimate in y-space. (4) The grid and estimate are transformed back to x-space, where the grid is not axis aligned. (5) Two options: plot directly if possible, OR call linbin_Ndim to bin the estimate f(x) in linear time to an axis-aligned grid in x-space, and then plot.

Thoughts on this?

PS. The diagrams are made using Textik.

blasern commented 5 years ago

So I guess it depends on the grid_points argument to the evaluate method. According to the documentation, this can be very different things. But in general it seems to create an axis-aligned grid from the original data. I definitely would want to be able to choose the grid myself.

If you use the second call to linbin_Ndim to transform a non-axis aligned grid to an axis-aligned grid, it would be good if we had some estimate of the error this induces. Not yet sure how to do this. Also not sure how this may depend on the chosen grid.

From my point of view, the purpose of a density estimator is to get density estimates in points of my choosing (not necessarily an axis aligned grid). Plotting these density estimates (your point 5) is secondary. Therefore I would prefer a solution that does not rely too heavily on the chosen grid.

tommyod commented 5 years ago

Very good points. I don't have all the answers, but some immediate thoughts:

Creating an AnisotropicKDE class where the user may not specify a grid is one possibility. The user could still specify the number of grid points, but the software would choose a grid aligned with the principal axes. Whether the resulting grid is axis-aligned in a final computation could be a user-specified parameter.
I believe research has been done on the error of histogram (constant) binning and linear binning. I don't know if research has been done on error of scaling and rotating data to a new grid - but I would be very surprised if that's not the case.
While NaiveKDE and TreeKDE accept any grid, FFTKDE only works on grids where the data is (1) fully "enclosed" in the grid, and when (2) the grid is axis aligned and equidistant with respect to each dimension. The reason for (2) is because the convolution requires an equidistant grid. The reason for (1) is admittedly a weak implementation. I should fix it - at the cost of some error checking or computation.
Let's call the grid aligned with the principal axes of the data the principal grid, and the final, axis-aligned grid the final grid. Since the whitening transform is performed prior to the specification of the principal grid, we could let the greatest diagonal value / eigenvalue (for the scaling) determine how fine the principal grid should be compared to the final grid. The user specifies a final grid, and based on how the data is stretched to y-space we determine a "good" principal grid.

By the way: it's great to discuss this with you. I see some potential and possibilities which would've eluded me if I was working alone.

blasern commented 5 years ago

I have a feeling that we are doing this wrong. Do we have to rotate and rescale the data? It may be easier to rotate and rescale the kernels. When initializing a kernel we already provide a variance, why not a rotation?

blasern commented 5 years ago

Here's an illustration of what I mean:

#imports
import numpy as np
from KDEpy import kernel_funcs
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
%matplotlib inline

# axis aligned grid
n = 64
p = np.linspace(-3, 3, num=n)
obs_x_dims = np.array(np.meshgrid(p, p)).T.reshape(-1, 2)

# rotation 
theta = np.pi/4
rotation_matrix = np.array([[np.cos(theta), -np.sin(theta)], 
                            [np.sin(theta), np.cos(theta)]])
obs_x_rotated = np.dot(obs_x_dims, rotation_matrix)
# scaling in one direction
obs_x_scaled = obs_x_rotated
obs_x_scaled[:, 0] *= 2

# calculate kernel
z = kernel_funcs.gaussian(obs_x_scaled, norm=2).reshape((n, n))
# plot back on regular meshgrid
plt.contour(*np.meshgrid(p, p), z)

tommyod commented 5 years ago

@blasern. Sorry for a very late reply.

Here's what I am thinking. Starting from randomly generated data.


#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Wed Dec 12 18:45:28 2018

@author: tommy
"""

import numpy as np
from numpy import linalg as la
import matplotlib.pyplot as plt
from KDEpy import FFTKDE

# --------- GENERATE DATA ---------
np.random.seed(123)
num_obs, num_dims = 40, 2
X = np.random.randn(num_obs, num_dims) + np.array([0.5, -0.8])

# Scale and rotate the data to make it more interesting
theta = np.pi/4
rotation = np.array([[np.cos(theta), np.sin(theta)], 
                     [-np.sin(theta), np.cos(theta)]])
scaling = np.diag([3, 1])
X = X @ scaling @ rotation

# --------- COMPUTE THE SVD ---------
plt.figure(figsize=(5, 5))  # Prepare a figure
plt.scatter(X[:, 0], X[:, 1], label='Original data')

# we now perform singular value decomposition of X
# see https://stats.stackexchange.com/questions/134282/relationship-between-svd-and-pca-how-to-use-svd-to-perform-pca
U, s, Vt = la.svd(X, full_matrices=False)
V = Vt.T

# Rotate the data to principal component space
rotated_X = X @ V @ np.diag(1 / s) * np.sqrt(num_obs)
plt.scatter(rotated_X[:, 0], rotated_X[:, 1], color='red', s=3, label='Rotated')

# Rotate back, check that it works
rotated_back = rotated_X / np.sqrt(num_obs) @ np.diag(s)  @ V.T 
plt.scatter(rotated_back[:, 0], rotated_back[:, 1], color='black',
            s=3, label='Rotated back')

plt.xlim([-5, 5]); plt.ylim([-5, 5]); plt.legend(); plt.show()

# --------- COMPUTE KERNEL DENSITY ESTIMATION ON ROTATED DATA ---------
grid_points = 2**7  # Grid points in each dimension
N, bw =10, 0.2  # Bandwidth in rotated space

# Compute the kernel density estimate
kde = FFTKDE(kernel='gaussian', norm=2, bw=bw)
grid, points = kde.fit(rotated_X).evaluate(grid_points)

# Prepare for plotting
fig, ax = plt.subplots(figsize=(5,5))
x, y = np.unique(grid[:, 0]), np.unique(grid[:, 1])
z = points.reshape(grid_points, grid_points).T

# Plot the kernel density estimate
ax.contour(x, y, z, N, linewidths=0.8, colors='k')
ax.contourf(x, y, z, N, cmap="RdBu_r")
ax.plot(rotated_X[:, 0].T, rotated_X[:, 1].T, 'ok', ms=3)

plt.xlim([-5, 5]); plt.ylim([-5, 5]) ;plt.show()

# --------- ROTATE THE GRID BACK ---------
# After rotation, the grid will not be alix-aligned
grid_rot = grid / np.sqrt(num_obs) @ np.linalg.inv(V @ np.diag(1/s))

# --------- RESAMPLE THE GRID ---------
# We pretend the data is the rotated grid, and the f(x, y) values are weights
# This is a re-sampling of the KDE onto an axis-aligned grid, and is needed
# since matplotlib requires axis-aligned grid for plotting.
# (The problem of plotting on an arbitrary grid is similar to density estimation)
kde = FFTKDE(kernel='gaussian', norm=2, bw=bw)
grid, points = kde.fit(grid_rot, weights=points).evaluate(grid_points)

# Create another figure
fig, ax = plt.subplots(figsize=(5, 5))
x, y = np.unique(grid[:, 0]), np.unique(grid[:, 1])
z = points.reshape(grid_points, grid_points).T

# Plot the kernel density estimate and the original data, for checking correctness
ax.contour(x, y, z, N, linewidths=0.8, colors='k')
ax.contourf(x, y, z, N, cmap="RdBu_r")
ax.plot(X[:, 0].T, X[:, 1].T, 'ok', ms=3)

plt.xlim([-5, 5]); plt.ylim([-5, 5]); plt.show()

blasern commented 5 years ago

I approve of applying KDE on the output of a KDE. But I guess there are still some issues that need to be addressed:

Should you use the same kernel allow different ones for the two steps?
How should the two bandwidths be calculated? Should they be related to one another?
Does this solve the problem of getting an anisotropic KDE? Or do we still have the same problem at the second KDE that is used for rotation?

So far I still have the feeling that rotating and rescaling the kernels may be the easier solution.

tommyod commented 5 years ago

@blasern These are valid issues. I have some thoughts, but it's easier to explain with a blackboard. I suggest we discuss this after Christmas :)

blasern commented 5 years ago

Sounds good. Happy holidays!

Kaiyangshi-Ito commented 1 year ago

@blasern. Sorry for a very late reply.

Here's what I am thinking. Starting from randomly generated data.

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Wed Dec 12 18:45:28 2018

@author: tommy
"""

import numpy as np
from numpy import linalg as la
import matplotlib.pyplot as plt
from KDEpy import FFTKDE

# --------- GENERATE DATA ---------
np.random.seed(123)
num_obs, num_dims = 40, 2
X = np.random.randn(num_obs, num_dims) + np.array([0.5, -0.8])

# Scale and rotate the data to make it more interesting
theta = np.pi/4
rotation = np.array([[np.cos(theta), np.sin(theta)], 
                     [-np.sin(theta), np.cos(theta)]])
scaling = np.diag([3, 1])
X = X @ scaling @ rotation

# --------- COMPUTE THE SVD ---------
plt.figure(figsize=(5, 5))  # Prepare a figure
plt.scatter(X[:, 0], X[:, 1], label='Original data')

# we now perform singular value decomposition of X
# see https://stats.stackexchange.com/questions/134282/relationship-between-svd-and-pca-how-to-use-svd-to-perform-pca
U, s, Vt = la.svd(X, full_matrices=False)
V = Vt.T

# Rotate the data to principal component space
rotated_X = X @ V @ np.diag(1 / s) * np.sqrt(num_obs)
plt.scatter(rotated_X[:, 0], rotated_X[:, 1], color='red', s=3, label='Rotated')

# Rotate back, check that it works
rotated_back = rotated_X / np.sqrt(num_obs) @ np.diag(s)  @ V.T 
plt.scatter(rotated_back[:, 0], rotated_back[:, 1], color='black',
            s=3, label='Rotated back')

plt.xlim([-5, 5]); plt.ylim([-5, 5]); plt.legend(); plt.show()

# --------- COMPUTE KERNEL DENSITY ESTIMATION ON ROTATED DATA ---------
grid_points = 2**7  # Grid points in each dimension
N, bw =10, 0.2  # Bandwidth in rotated space

# Compute the kernel density estimate
kde = FFTKDE(kernel='gaussian', norm=2, bw=bw)
grid, points = kde.fit(rotated_X).evaluate(grid_points)

# Prepare for plotting
fig, ax = plt.subplots(figsize=(5,5))
x, y = np.unique(grid[:, 0]), np.unique(grid[:, 1])
z = points.reshape(grid_points, grid_points).T

# Plot the kernel density estimate
ax.contour(x, y, z, N, linewidths=0.8, colors='k')
ax.contourf(x, y, z, N, cmap="RdBu_r")
ax.plot(rotated_X[:, 0].T, rotated_X[:, 1].T, 'ok', ms=3)

plt.xlim([-5, 5]); plt.ylim([-5, 5]) ;plt.show()

# --------- ROTATE THE GRID BACK ---------
# After rotation, the grid will not be alix-aligned
grid_rot = grid / np.sqrt(num_obs) @ np.linalg.inv(V @ np.diag(1/s))

# --------- RESAMPLE THE GRID ---------
# We pretend the data is the rotated grid, and the f(x, y) values are weights
# This is a re-sampling of the KDE onto an axis-aligned grid, and is needed
# since matplotlib requires axis-aligned grid for plotting.
# (The problem of plotting on an arbitrary grid is similar to density estimation)
kde = FFTKDE(kernel='gaussian', norm=2, bw=bw)
grid, points = kde.fit(grid_rot, weights=points).evaluate(grid_points)

# Create another figure
fig, ax = plt.subplots(figsize=(5, 5))
x, y = np.unique(grid[:, 0]), np.unique(grid[:, 1])
z = points.reshape(grid_points, grid_points).T

# Plot the kernel density estimate and the original data, for checking correctness
ax.contour(x, y, z, N, linewidths=0.8, colors='k')
ax.contourf(x, y, z, N, cmap="RdBu_r")
ax.plot(X[:, 0].T, X[:, 1].T, 'ok', ms=3)

plt.xlim([-5, 5]); plt.ylim([-5, 5]); plt.show()

Alternatively, one can just evaluate the initial KDE on a non-equidistant grid -- a grid being transformed using the same PCA transformation applied on the data. Then a much more dense grid will help a lot.

The authentic way to do this is, however, to carry out a bandwidth matrix computation, see https://en.wikipedia.org/wiki/Multivariate_kernel_density_estimation#Density_estimation_with_a_full_bandwidth_matrix, they mentioned a "data-driven" approach, which might well be slow and not exactly what we want here...

I am more thinking of carrying out the same approach Botev did, but for 2d data -- the proof shouldn't be all that difficult, considering that we are blessed with the fact that all norms are equivalent in finite-dimensional space.

tommyod / KDEpy

Anisotropic KDEs via PCA #6