sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.38k stars 317 forks source link

In the synthetic of time-series data, keep the `sequence_key` consistent with the original data. #2226

Closed jalr4ever closed 2 weeks ago

jalr4ever commented 2 months ago

Problem Description

I want the sequence_key values in the data simulated by PARSynthesizer to be consistent with the original data. Currently, due to SDV requirements, sequence_key is specified as an ID type, and ID types generate random values, which does not meet my needs.

Expected behavior

from datetime import datetime
from datetime import timedelta

import numpy as np
import pandas as pd
from sdv.metadata import SingleTableMetadata
from sdv.sequential import PARSynthesizer

def mock_data():
    seq_ids = [900001, 900002, 900003, 900004, 900005, 900006, 900007, 900008, 900009]
    start_time = datetime(2023, 10, 28, 4, 15)
    data = []

    for seq_id in seq_ids:
        for i in range(5):
            date_time = start_time + timedelta(minutes=15 * i)
            value = np.random.uniform(1, 100)
            data.append({'seq_id': seq_id, 'datetime': date_time, 'value': value})

    df = pd.DataFrame(data)
    return df

real_data = mock_data()
print(real_data.head(100))

metadata = SingleTableMetadata()
metadata.detect_from_dataframe(real_data)
metadata.update_column(column_name='seq_id', sdtype='id')
metadata.set_sequence_key('seq_id')
metadata.set_sequence_index('datetime')
start_time = datetime.now()
synthesizer = PARSynthesizer(metadata, verbose=True, epochs=128)
synthesizer.fit(real_data)
end_time = datetime.now()

synthetic_data = synthesizer.sample(num_sequences=9, sequence_length=5)
print(synthetic_data.head(100))

Output:

    seq_id            datetime      value
0   900001 2023-10-28 04:15:00  88.343075
1   900001 2023-10-28 04:30:00  24.453783
2   900001 2023-10-28 04:45:00  66.201311
3   900001 2023-10-28 05:00:00  68.288793
4   900001 2023-10-28 05:15:00  10.555130
5   900002 2023-10-28 04:15:00  80.262662
6   900002 2023-10-28 04:30:00  24.370064
7   900002 2023-10-28 04:45:00  44.250974
8   900002 2023-10-28 05:00:00  64.370600
9   900002 2023-10-28 05:15:00  45.912854
10  900003 2023-10-28 04:15:00  25.695243
11  900003 2023-10-28 04:30:00  36.785977
12  900003 2023-10-28 04:45:00  59.255933
13  900003 2023-10-28 05:00:00  83.631524
14  900003 2023-10-28 05:15:00  34.161183
15  900004 2023-10-28 04:15:00  76.622192
16  900004 2023-10-28 04:30:00  97.635065
17  900004 2023-10-28 04:45:00  50.463373
18  900004 2023-10-28 05:00:00  53.162506
19  900004 2023-10-28 05:15:00  45.024679
20  900005 2023-10-28 04:15:00  55.967175
21  900005 2023-10-28 04:30:00  60.319006
22  900005 2023-10-28 04:45:00   4.969765
23  900005 2023-10-28 05:00:00  86.819081
24  900005 2023-10-28 05:15:00  30.062426
25  900006 2023-10-28 04:15:00  24.383062
26  900006 2023-10-28 04:30:00  73.899661
27  900006 2023-10-28 04:45:00  24.390200
28  900006 2023-10-28 05:00:00  51.452548
29  900006 2023-10-28 05:15:00  15.362983
30  900007 2023-10-28 04:15:00   6.963062
31  900007 2023-10-28 04:30:00  23.488596
32  900007 2023-10-28 04:45:00  71.267673
33  900007 2023-10-28 05:00:00  87.326087
34  900007 2023-10-28 05:15:00  89.441880
35  900008 2023-10-28 04:15:00   5.404834
36  900008 2023-10-28 04:30:00  56.468166
37  900008 2023-10-28 04:45:00  84.345769
38  900008 2023-10-28 05:00:00  62.231551
39  900008 2023-10-28 05:15:00   4.202525
40  900009 2023-10-28 04:15:00  73.981153
41  900009 2023-10-28 04:30:00  96.336560
42  900009 2023-10-28 04:45:00  46.061168
43  900009 2023-10-28 05:00:00  58.442370
44  900009 2023-10-28 05:15:00  74.846215
Loss (-0.589): 100%|██████████| 128/128 [00:00<00:00, 177.78it/s]
100%|██████████| 9/9 [00:00<00:00, 288.40it/s]
       seq_id            datetime      value
0   636572584 2023-10-28 04:15:00  58.651855
1   636572584 2023-10-28 04:30:00  71.247476
2   636572584 2023-10-28 04:45:00  80.382582
3   636572584 2023-10-28 05:00:00  58.768539
4   636572584 2023-10-28 05:15:00   4.202525
5   705351915 2023-10-28 04:15:00  57.333614
6   705351915 2023-10-28 04:30:00  48.996900
7   705351915 2023-10-28 04:45:00  26.126351
8   705351915 2023-10-28 05:00:00  19.670256
9   705351915 2023-10-28 05:15:00  28.732383
10  698301954 2023-10-28 04:15:00  70.868821
11  698301954 2023-10-28 04:30:00  35.946020
12  698301954 2023-10-28 04:45:00  45.832996
13  698301954 2023-10-28 05:00:00  55.259220
14  698301954 2023-10-28 05:15:00  24.919739
15  162314092 2023-10-28 04:15:00  52.038490
16  162314092 2023-10-28 04:30:00  68.241808
17  162314092 2023-10-28 04:45:00  56.938387
18  162314092 2023-10-28 05:00:00  38.176355
19  162314092 2023-10-28 05:15:00  48.828152
20  601353867 2023-10-28 04:15:00  12.429452
21  601353867 2023-10-28 04:30:00  24.435029
22  601353867 2023-10-28 04:45:00  69.493305
23  601353867 2023-10-28 05:00:00  29.973742
24  601353867 2023-10-28 05:15:00  29.952920
25  597864398 2023-10-28 04:15:00  87.737777
26  597864398 2023-10-28 04:30:00  50.875915
27  597864398 2023-10-28 04:45:00  97.635065
28  597864398 2023-10-28 05:00:00  52.373688
29  597864398 2023-10-28 05:15:00  75.213350
30  522040997 2023-10-28 04:15:00  33.702249
31  522040997 2023-10-28 04:30:00  70.472768
32  522040997 2023-10-28 04:45:00  44.026007
33  522040997 2023-10-28 05:00:00  77.789348
34  522040997 2023-10-28 05:15:00  57.519564
35  679899017 2023-10-28 04:15:00  48.479804
36  679899017 2023-10-28 04:30:00  40.817928
37  679899017 2023-10-28 04:45:00  60.656329
38  679899017 2023-10-28 05:00:00  37.039939
39  679899017 2023-10-28 05:15:00  31.768680
40  813610428 2023-10-28 04:15:00  45.273490
41  813610428 2023-10-28 04:30:00  66.579771
42  813610428 2023-10-28 04:45:00  69.789106
43  813610428 2023-10-28 05:00:00  10.598702
44  813610428 2023-10-28 05:15:00  43.095843

This code shows that a simple demo for PARSynthesizer, and seq_id in synthetic_data does't match with real_data

Expectation: provide a solution that ensures the ID values of my synthetic_data and real data are consistent, not just in format but completely identical in value.

Additional context

<Please provide any additional context that may be relevant to the issue here. If none, please remove this section.>

srinify commented 2 months ago

Hi @jalr4ever I'm curious to learn more -- why is generating the same ID values important for your use case?

One challenge is that you may have 5 sequence_key values in your real data but request 500 sequences from the trained synthesizer, which creates an ambiguous situation for what the remaining 495 sequence_key values should be.

If the rough format of the generated sequence_key values are important to you, you can specify a regular expression string in your dtype: https://docs.sdv.dev/sdv/reference/metadata-spec/sdtypes#id This workflow has some guardrails built in because of the ambiguity resulting from a small set of sequence_key values in the real data and a potentially large set requested in the synthetic data (e.g. If you define a regex format that only allows 2 digit numeric values but ask for 1000 sequences, SDV will throw an error).

npatki commented 2 months ago

Hi @jalr4ever, adding to @srinify's comment: SDV's synthetic data is designed to create brand new entities that are not necessarily analogous with any entity of your real data. So each synthetic sequence represents an entirely new entity -- it does not map to any one, analogous real sequence. If the desire is to have a fully complete, 1-to-1 mapping between a real sequence and synthetic sequence (same sequence ID), then I would suggest anonymizing the real data itself rather than create synthetic data.

If you could describe your needs, we'd be happy to guide you to a solution. How are you planning to use the synthetic data, and what do the sequences represent?

jalr4ever commented 2 months ago

Hi, @srinify @npatki. Thank you both for your replies. To put it simply, I need to find out what the sequences in my original data correspond to in the synthetic data. My scenario is as follows: I need to provide a report that can show a comparative chart of the distribution similarities between the sequences in the original data and those in the synthetic data. Therefore, I need an "ID" column that allows for a one-to-one correspondence between the original and synthetic data so that I can calculate the distribution of data corresponding to each sequence ID and create plots from it.

Regarding the 5-495 issue mentioned by @srinify : Actually, I am not interested in synthetic data that exceeds the range of the original data sequence; currently, I will perform a groupby() on the ID column of the original data and then count(), passing it to SDV to generate a specified number of values that match my original data. So for the 5-495 issue, I understand that SDV simply does not have corresponding boundary control at this time. There is a design for non-sequential data with enforce_min_max_values=True, but there is no such design as enforce_max_sequence_num=True for sequential data.

Overall, my requirements can be divided into two aspects: first, support for boundary control like enforce_max_sequence_num=True; second, support for one-to-one mapping of sequence IDs. It would be best to provide control options so that synthetic data corresponds directly to real data. If that's not possible, then please provide a mapping list that maps sequence IDs to original data IDs one-to-one.

npatki commented 1 month ago

Hi @jalr4ever, unfortunately the PARSynthesizer is not designed to ever learn or create an exact 1-to-1 analogous mapping.

To illustrate this, see the example table in our docs page. In this example, each Patient ID is a sequence. The synthetic data is designed to represent brand new patients that do not correspond to any 1 original patient and health-related sequences for each one. It is not designed to recycle the same patients that are already in the real data.

I would love to understand a bit more about your use case. Why is is it needed to have the exact same sequence IDs? What does each sequence ID represent in your data and how are you planning to use the synthetic data after creating it?

If it is a matter of showing a report, we can recommend some different metrics and visualizations that are more attuned to multi-sequence data (where you do not have an analogous 1-to-1 mapping).

jalr4ever commented 1 month ago

Hi @npatki. In fact, we will use this data for machine learning, but how do we assess the reliability of this data? In non-time-series data, there are metrics that can abstract the "Shape" of the data (KSComplement). I would like to print out the "Shape" corresponding to each sequence in temporal data for comparison as well.

npatki commented 1 month ago

Hi @jalr4ever, just out of curiosity: If you are planning to use the data for machine learning, I assume you have a train/validation/test data setup. Is it the case that your validation/test data always has the same sequence IDs as the real data? What about any new data for which you'd want to make a prediction?

As for metrics and visualization:

  1. Since you already have a machine learning use case in mind, I think the best "metric" here might be to directly measure the ROI. Eg. what is the predictive accuracy before vs. after using synthetic data?
  2. I would recommend looking into our original PARSynthesizer paper. In section 4.2, we describe a framework called MSAS (Multi-Sequence Aggregate Similarity) that is aimed to capture the exact question that you have. Unfortunately, this metric is not yet available in SDMetrics but we hope to add it soon!
jalr4ever commented 1 month ago

@npatki Hi, thank you for your suggestion. I will take a look at the MSAS metric. Currently, our training is actually focused on individual sequences; we train a prediction model for each sequence and perform test/train data splitting based on the sequence data, which means that the sequence IDs in the data are the same. Therefore, we want to know which original sequence corresponds to the sequences in the synthetic data so that we can understand which original sequence this model represents.

npatki commented 1 month ago

Thank you for your comments @jalr4ever. Very helpful.

In your case, I'm not entirely sure if synthetic data is the right approach, as synthetic data is inherently designed to create brand new sequences belonging to entirely new entities. If the desire to is have only the same sequences, I am thinking perhaps anonymization or noising data would be sufficient (rather than synthetic data)?

May I ask why you are unable to train/test on the real sequences? Is it a matter of privacy, or do you simply not have long enough sequences for the task?

jalr4ever commented 3 weeks ago

@npatki Yes, our current solution involves anonymization. We implemented this due to privacy concerns when sharing data between departments.

npatki commented 2 weeks ago

Hi @jalr4ever if you're interested in pure anonymization or perturbations of the existing data, there's a chance that the RDT library may help. It allows you to transform the existing data, and has a few features for anonymization.

If your team ever wants to explore creating brand new sequences (for eg. to test out a variety of diverse scenarios, or scale up your data) we'd gladly help you to explore synthetic data solutions with PARSynthesizer.

For now, I'm closing off the issue, but please feel free to reply if there is more to discuss and I can always re-open. (Alternatively, file a new issue for a new topic.) Thanks.