sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.21k stars 287 forks source link

Add a utility function `get_random_sequence_subset` #2085

Open npatki opened 1 week ago

npatki commented 1 week ago

Problem Description

Subsetting single and multi-table data is easy by using existing functions such as get_random_subset.

But subsetting sequential data is not as easy. Since different rows can belong together (within the same sequence) and have an order, it's not possible to simply select random rows. For such data, it will be helpful to have a utility unction to perform the subsetting.

Expected behavior

Add a function to utils called get_random_sequence_subset to be used by sequential data.

Parameters:

from sdv.utils import get_random_sequence_subset

data_subset = get_random_sequence_subset(data, metadata,
  num_sequences=100, 
  max_sequence_length=1000,
  long_sequence_subsampling_method='last_rows')

The function would do the following:

Return the shortened pandas DataFrame with the subsampled data. Ensure that the index of the DataFrame has been reset.

Additional context

import numpy as np

def get_random_sequence_subset(data, metadata, num_sequences):
  sequence_key = metadata.to_dict()['sequence_key']
  unique_sequences = data[sequence_key].unique()
  sequence_subset = np.random.choice(unique_sequences, size=num_sequences)
  subsetted_data = data[data[sequence_key].isin(sequence_subset)].reset_index(drop=True)
  return subsetted_data
amontanez24 commented 4 days ago

@npatki Should this also be in poc?