sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.38k stars 317 forks source link

InvalidDataError when fitting datetime columns as context columns in PARSynthesizer #2115

Closed ardulat closed 3 months ago

ardulat commented 4 months ago

Environment Details

Please indicate the following details about the environment in which you found the bug:

Error Description

I am encountering an InvalidDataError similar to the issue: https://github.com/sdv-dev/SDV/issues/1772. I am passing participant__date_of_birth column as a context column to the PARSynthesizer, which fails during a call to .fit function. Here is a full error message:

sdv.errors.InvalidDataError: The provided data does not match the metadata:
Invalid values found for datetime column 'participant__date_of_birth': [-1.1264832e+18, -1.1267424e+18, -1.1268288e+18, '+ 5116 more'].

I tried debugging the .fit method for the PARSynthesizer and found that there is a call to the .preprocess function, which alters my data. Here is an example:

Input data:

0        1918-01-13
1        1918-01-13
2        1918-01-13
3        1918-01-13
4        1918-01-13
            ...
134358   1930-05-17
134359   1930-05-17
134360   1930-05-17
134361   1930-05-17
134362   1930-05-17
Name: participant__date_of_birth, Length: 134363, dtype: datetime64[ns]

The input data after a call to https://github.com/sdv-dev/SDV/blob/main/sdv/single_table/base.py#L471:

primary_key
1        -1.639958e+18
2        -1.639958e+18
3        -1.639958e+18
4        -1.639958e+18
5        -1.639958e+18
              ...
134359   -1.250554e+18
134360   -1.250554e+18
134361   -1.250554e+18
134362   -1.250554e+18
134363   -1.250554e+18
Name: participant__date_of_birth, Length: 134363, dtype: float64

Can you please suggest a good way to handle datetime columns in the context?

npatki commented 4 months ago

Hi @ardulat nice to meet you.

There is currently a known issue #1485 specifically for context columns that are datetimes. Fortunately, there is a workaround you can use in the meantime which I've presented in this comment.

Let me know if my understanding is off or if the workaround does not do the trick. It will always be helpful if you can share your code (where you instantiate and use PARSynthesizer) as well as your metadata. Thanks.

ardulat commented 4 months ago

Hi, @npatki likewise. Thank you for your quick reply!

The workaround from the comment helps, and the synthesizer trains and generates without any issues. The generated distribution is similar to the original data. However, I believe it's not fully correct to train the synthesizer on timestamp data for the birthdates since, as a result, it generates dates precise to nanoseconds, which in real life is less likely to happen in the data. I hope this can be solved in future versions. Thank you again!

npatki commented 4 months ago

Hi @ardulat you're welcome. The workaround I have just converts the Unix timestamps back to a datetime at the highest possible precision (nanoseconds). This is done intentionally so that you can round it off later to the nearest second, day, or whatever other precision level you desire.

For example, if you need precision at the second (instead of nanoseconds) you can do this:

synthetic_data[COLUMN_NAME] = pd.to_datetime(synthetic_data[COLUMN_NAME], unit='ns').round('1s')

Or for a day:

synthetic_data[COLUMN_NAME]  = pd.to_datetime(synthetic_data[COLUMN_NAME], unit='ns').round('1d')

Hope that helps.

npatki commented 3 months ago

Hi @ardulat, I'm going to close this issue out since we have a workaround. I'll defer to the original issue of #1485 for a fix. (FYI at the time of this writing, it looks like this issue is closed, meaning that the fix will be available in the upcoming SDV release.)

ardulat commented 1 month ago

Hi @npatki!

As you suggested, I converted dates in the context column to timestamps as follows:

# Spark code
for col_name in self.context_columns:
    if self.context_df.schema[col_name].dataType == T.DateType():
        self.converted_date_columns.add(col_name)  # Save for further usage
        processed_context_df = processed_context_df.withColumn(
            col_name, F.to_timestamp(F.col(col_name), "yyyy-MM-dd")
        )

# Pandas code
for col_name in self.context_columns:
    if output_df[col_name].dtype == "datetime64[ns]":
        output_df[col_name] = pd.to_datetime(
            output_df[col_name], format="%Y-%m-%d"
        ).astype(int)

Then, I convert the sampled timestamps back to dates:

# Convert timestamp columns in the context back to datetime
for col_name in self.converted_date_columns:
    data[col_name] = pd.to_datetime(data[col_name], unit="ns").dt.date

As a result, I am getting irrelevant dates in the range from year 1677 to year 2253:

Screenshot 2024-09-23 at 2 30 51 PM

What can I do to produce relevant dates with distribution similar to the training data?

npatki commented 1 month ago

Hi @ardulat, since you first filed this bug, the underlying issue (#1485) has already been resolved. You no longer need to apply a workaround of converting the datetime columns to numerical. I would recommend upgrading your SDV version to the latest and re-trying the synthesis without any workarounds.

If you are continuing to have problems with it, please file a new issue so that we can take a look. Thanks.