Sample from arbitrary numpy array

polyaurn commented 4 years ago

I read all your doc and followed your jupyter notebook tutorials. I can’t find any API to use if I have a readily available data as input matrix instead of generating one from an existing random matrix group.

Say I have a numpy array of dimension mxn; meaning that I have m observations characterized each as n-dimensional feature vector. Is it possible/easy to sample from this matrix using DPPy?

guilgautier commented 4 years ago

Hi @polyaurn, Thanks for your time playing with DPPy!

I'm not sure to understand what you mean by

Is it possible/easy to sample from this matrix using DPPy?

Do you want to subsample the rows (data points) or the columns (features) of your feature matrix using a specific DPP? Do you want to sample from a finite DPP built from this feature matrix?

Could you write a pseudocode explaining what you're trying to do with your feature matrix?

polyaurn commented 4 years ago

Hi @guilgautier,

Thanks for your prompt response. I want to subsample the rows (data points) of my feature matrix M. I want to subsample the rows (data points) i.e. get a small fraction of data points from M that are very diverse.

But also, is it possible to subsample the columns (features) with DPPy? I'd typically use column subset selection with Nystrom methods for this purpose.

Thanks again!

guilgautier commented 4 years ago

All right,

I want to subsample the rows (data points) i.e. get a small fraction of data points from M that are very diverse.

But also, is it possible to subsample the columns (features) with DPPy? I'd typically use column subset selection with Nystrom methods for this purpose.


import numpy as np
from dppy.finite_dpps import FiniteDPP

# Set a seed for reproducibility
seed = 123
rng = np.random.RandomState(seed)

m, n = 10, 5  # nb_points, nb_features
X = rng.randn(m, n)  # feature matrix
print('feature matrix {}x{}'.format(m, n),
      X,
      sep='\n')

# The following are simple examples:
# 1. to subsample the rows
L_rows = X.dot(X.T)  # m x m
dpp_rows = FiniteDPP(kernel_type='likelihood', L=L_rows)

sub_rows = dpp_rows.sample_exact(random_state=seed)

print('row subsample {}'.format(sub_rows),
      X[sub_rows, :],
      sep='\n')

# 2. to subsample the columns
L_cols = X.T.dot(X)  # n x n
dpp_cols = FiniteDPP(kernel_type='likelihood', L=L_cols)

sub_cols = dpp_cols.sample_exact(random_state=seed)

print('column subsample {}'.format(sub_cols),
      X[:, sub_cols],
      sep='\n')

feature matrix 10x5
[[-1.0856306   0.99734545  0.2829785  -1.50629471 -0.57860025]
 [ 1.65143654 -2.42667924 -0.42891263  1.26593626 -0.8667404 ]
 [-0.67888615 -0.09470897  1.49138963 -0.638902   -0.44398196]
 [-0.43435128  2.20593008  2.18678609  1.0040539   0.3861864 ]
 [ 0.73736858  1.49073203 -0.93583387  1.17582904 -1.25388067]
 [-0.6377515   0.9071052  -1.4286807  -0.14006872 -0.8617549 ]
 [-0.25561937 -2.79858911 -1.7715331  -0.69987723  0.92746243]
 [-0.17363568  0.00284592  0.68822271 -0.87953634  0.28362732]
 [-0.80536652 -1.72766949 -0.39089979  0.57380586  0.33858905]
 [-0.01183049  2.39236527  0.41291216  0.97873601  2.23814334]]
row subsample [3, 8, 4, 0]
[[-0.43435128  2.20593008  2.18678609  1.0040539   0.3861864 ]
 [-0.80536652 -1.72766949 -0.39089979  0.57380586  0.33858905]
 [ 0.73736858  1.49073203 -0.93583387  1.17582904 -1.25388067]
 [-1.0856306   0.99734545  0.2829785  -1.50629471 -0.57860025]]
column subsample [2, 4, 3, 1]
[[ 0.2829785  -0.57860025 -1.50629471  0.99734545]
 [-0.42891263 -0.8667404   1.26593626 -2.42667924]
 [ 1.49138963 -0.44398196 -0.638902   -0.09470897]
 [ 2.18678609  0.3861864   1.0040539   2.20593008]
 [-0.93583387 -1.25388067  1.17582904  1.49073203]
 [-1.4286807  -0.8617549  -0.14006872  0.9071052 ]
 [-1.7715331   0.92746243 -0.69987723 -2.79858911]
 [ 0.68822271  0.28362732 -0.87953634  0.00284592]
 [-0.39089979  0.33858905  0.57380586 -1.72766949]
 [ 0.41291216  2.23814334  0.97873601  2.39236527]]

Note: you can also subsample exactly k rows/columns using a k-DPP

The main point is to adapt the choice of the DPP kernel to your tasks, here I've used the Gram matrix of the features/data points, but you can obviously take a different option. In particular, for column subset selection you can have a look at this paper of Belhadji, Bardenet, Chainais (2018) which gives some theoretical results when using a specific projection DPP adapted to your data.

Hope this will help

guilgautier commented 4 years ago

Closing this for now, feel free to reopen it @polyaurn

guilgautier / DPPy

Sample from arbitrary numpy array #56