Add a utility function `get_random_sequence_subset`

Problem Description

Subsetting single and multi-table data is easy by using existing functions such as get_random_subset.

But subsetting sequential data is not as easy. Since different rows can belong together (within the same sequence) and have an order, it's not possible to simply select random rows. For such data, it will be helpful to have a utility unction to perform the subsetting.

Expected behavior

Add a function to utils called get_random_sequence_subset to be used by sequential data.

Parameters:

(required) data: A pandas DataFrame with the sequential data
(required) metadata: A SingleTableMetadata object describing the data
(required) num_sequences: The number of sequences to subsample
max_sequence_length: The maximum length each subsampled sequence is allowed to be
- (default) None: Do not enforce any max length, meaning that entire sequences will be sampled
- int: All subsampled sequences must be <= the provided length
long_sequence_subsampling_method: The method to use when a selected sequence is too long
- (default) first_rows: Keep the first n rows of the sequence, where n is the max sequence length
- last_rows: Keep the last n rows of the sequence, where n is the max sequence length
- random: Randomly choose n rows to keep within the sequence. It is important to keep the randomly chosen rows in the same order as they appear in the original data.

from sdv.utils import get_random_sequence_subset

data_subset = get_random_sequence_subset(data, metadata,
  num_sequences=100, 
  max_sequence_length=1000,
  long_sequence_subsampling_method='last_rows')

The function would do the following:

Randomly select sequences according to num_sequences parameter. (Note that the sequence_key is used in determining sequences.)
For each selected sequence, ensure that the length is <= max_sequence_length. If sequences are longer, then use the long_sequence_subsampling_method to make it shorter

Return the shortened pandas DataFrame with the subsampled data. Ensure that the index of the DataFrame has been reset.

Additional context

The metadata must contain a sequence_key -- otherwise it is not multi-sequence data and not really eligible for this type of subsampling. If there is no sequence_key, throw an error
As a starting point, below is some code we've provided to a user to sample entire sequences. Note that this code does not consider max sequence length at all.

import numpy as np

def get_random_sequence_subset(data, metadata, num_sequences):
  sequence_key = metadata.to_dict()['sequence_key']
  unique_sequences = data[sequence_key].unique()
  sequence_subset = np.random.choice(unique_sequences, size=num_sequences)
  subsetted_data = data[data[sequence_key].isin(sequence_subset)].reset_index(drop=True)
  return subsetted_data

sdv-dev / SDV