Closed jalr4ever closed 2 weeks ago
Hi @jalr4ever I'm curious to learn more -- why is generating the same ID values important for your use case?
One challenge is that you may have 5 sequence_key values in your real data but request 500 sequences from the trained synthesizer, which creates an ambiguous situation for what the remaining 495 sequence_key values should be.
If the rough format of the generated sequence_key values are important to you, you can specify a regular expression string in your dtype: https://docs.sdv.dev/sdv/reference/metadata-spec/sdtypes#id This workflow has some guardrails built in because of the ambiguity resulting from a small set of sequence_key values in the real data and a potentially large set requested in the synthetic data (e.g. If you define a regex format that only allows 2 digit numeric values but ask for 1000 sequences, SDV will throw an error).
Hi @jalr4ever, adding to @srinify's comment: SDV's synthetic data is designed to create brand new entities that are not necessarily analogous with any entity of your real data. So each synthetic sequence represents an entirely new entity -- it does not map to any one, analogous real sequence. If the desire is to have a fully complete, 1-to-1 mapping between a real sequence and synthetic sequence (same sequence ID), then I would suggest anonymizing the real data itself rather than create synthetic data.
If you could describe your needs, we'd be happy to guide you to a solution. How are you planning to use the synthetic data, and what do the sequences represent?
Hi, @srinify @npatki. Thank you both for your replies. To put it simply, I need to find out what the sequences in my original data correspond to in the synthetic data. My scenario is as follows: I need to provide a report that can show a comparative chart of the distribution similarities between the sequences in the original data and those in the synthetic data. Therefore, I need an "ID" column that allows for a one-to-one correspondence between the original and synthetic data so that I can calculate the distribution of data corresponding to each sequence ID and create plots from it.
Regarding the 5-495
issue mentioned by @srinify : Actually, I am not interested in synthetic data that exceeds the range of the original data sequence; currently, I will perform a groupby()
on the ID column of the original data and then count()
, passing it to SDV to generate a specified number of values that match my original data. So for the 5-495
issue, I understand that SDV simply does not have corresponding boundary control at this time. There is a design for non-sequential data with enforce_min_max_values=True
, but there is no such design as enforce_max_sequence_num=True
for sequential data.
Overall, my requirements can be divided into two aspects: first, support for boundary control like enforce_max_sequence_num=True
; second, support for one-to-one mapping of sequence IDs. It would be best to provide control options so that synthetic data corresponds directly to real data. If that's not possible, then please provide a mapping list that maps sequence IDs to original data IDs one-to-one.
Hi @jalr4ever, unfortunately the PARSynthesizer is not designed to ever learn or create an exact 1-to-1 analogous mapping.
To illustrate this, see the example table in our docs page. In this example, each Patient ID is a sequence. The synthetic data is designed to represent brand new patients that do not correspond to any 1 original patient and health-related sequences for each one. It is not designed to recycle the same patients that are already in the real data.
I would love to understand a bit more about your use case. Why is is it needed to have the exact same sequence IDs? What does each sequence ID represent in your data and how are you planning to use the synthetic data after creating it?
If it is a matter of showing a report, we can recommend some different metrics and visualizations that are more attuned to multi-sequence data (where you do not have an analogous 1-to-1 mapping).
Hi @npatki. In fact, we will use this data for machine learning, but how do we assess the reliability of this data? In non-time-series data, there are metrics that can abstract the "Shape" of the data (KSComplement). I would like to print out the "Shape" corresponding to each sequence in temporal data for comparison as well.
Hi @jalr4ever, just out of curiosity: If you are planning to use the data for machine learning, I assume you have a train/validation/test data setup. Is it the case that your validation/test data always has the same sequence IDs as the real data? What about any new data for which you'd want to make a prediction?
As for metrics and visualization:
@npatki Hi, thank you for your suggestion. I will take a look at the MSAS metric. Currently, our training is actually focused on individual sequences; we train a prediction model for each sequence and perform test/train data splitting based on the sequence data, which means that the sequence IDs in the data are the same. Therefore, we want to know which original sequence corresponds to the sequences in the synthetic data so that we can understand which original sequence this model represents.
Thank you for your comments @jalr4ever. Very helpful.
In your case, I'm not entirely sure if synthetic data is the right approach, as synthetic data is inherently designed to create brand new sequences belonging to entirely new entities. If the desire to is have only the same sequences, I am thinking perhaps anonymization or noising data would be sufficient (rather than synthetic data)?
May I ask why you are unable to train/test on the real sequences? Is it a matter of privacy, or do you simply not have long enough sequences for the task?
@npatki Yes, our current solution involves anonymization. We implemented this due to privacy concerns when sharing data between departments.
Hi @jalr4ever if you're interested in pure anonymization or perturbations of the existing data, there's a chance that the RDT library may help. It allows you to transform the existing data, and has a few features for anonymization.
If your team ever wants to explore creating brand new sequences (for eg. to test out a variety of diverse scenarios, or scale up your data) we'd gladly help you to explore synthetic data solutions with PARSynthesizer.
For now, I'm closing off the issue, but please feel free to reply if there is more to discuss and I can always re-open. (Alternatively, file a new issue for a new topic.) Thanks.
Problem Description
I want the sequence_key values in the data simulated by PARSynthesizer to be consistent with the original data. Currently, due to SDV requirements, sequence_key is specified as an ID type, and ID types generate random values, which does not meet my needs.
Expected behavior
Output:
This code shows that a simple demo for PARSynthesizer, and
seq_id
in synthetic_data does't match with real_dataExpectation: provide a solution that ensures the ID values of my synthetic_data and real data are consistent, not just in format but completely identical in value.
Additional context
<Please provide any additional context that may be relevant to the issue here. If none, please remove this section.>