Closed ardulat closed 3 months ago
Hi @ardulat nice to meet you.
There is currently a known issue #1485 specifically for context columns that are datetimes. Fortunately, there is a workaround you can use in the meantime which I've presented in this comment.
Let me know if my understanding is off or if the workaround does not do the trick. It will always be helpful if you can share your code (where you instantiate and use PARSynthesizer) as well as your metadata. Thanks.
Hi, @npatki likewise. Thank you for your quick reply!
The workaround from the comment helps, and the synthesizer trains and generates without any issues. The generated distribution is similar to the original data. However, I believe it's not fully correct to train the synthesizer on timestamp data for the birthdates since, as a result, it generates dates precise to nanoseconds, which in real life is less likely to happen in the data. I hope this can be solved in future versions. Thank you again!
Hi @ardulat you're welcome. The workaround I have just converts the Unix timestamps back to a datetime at the highest possible precision (nanoseconds). This is done intentionally so that you can round it off later to the nearest second, day, or whatever other precision level you desire.
For example, if you need precision at the second (instead of nanoseconds) you can do this:
synthetic_data[COLUMN_NAME] = pd.to_datetime(synthetic_data[COLUMN_NAME], unit='ns').round('1s')
Or for a day:
synthetic_data[COLUMN_NAME] = pd.to_datetime(synthetic_data[COLUMN_NAME], unit='ns').round('1d')
Hope that helps.
Hi @ardulat, I'm going to close this issue out since we have a workaround. I'll defer to the original issue of #1485 for a fix. (FYI at the time of this writing, it looks like this issue is closed, meaning that the fix will be available in the upcoming SDV release.)
Hi @npatki!
As you suggested, I converted dates in the context column to timestamps as follows:
# Spark code
for col_name in self.context_columns:
if self.context_df.schema[col_name].dataType == T.DateType():
self.converted_date_columns.add(col_name) # Save for further usage
processed_context_df = processed_context_df.withColumn(
col_name, F.to_timestamp(F.col(col_name), "yyyy-MM-dd")
)
# Pandas code
for col_name in self.context_columns:
if output_df[col_name].dtype == "datetime64[ns]":
output_df[col_name] = pd.to_datetime(
output_df[col_name], format="%Y-%m-%d"
).astype(int)
Then, I convert the sampled timestamps back to dates:
# Convert timestamp columns in the context back to datetime
for col_name in self.converted_date_columns:
data[col_name] = pd.to_datetime(data[col_name], unit="ns").dt.date
As a result, I am getting irrelevant dates in the range from year 1677 to year 2253:
What can I do to produce relevant dates with distribution similar to the training data?
Hi @ardulat, since you first filed this bug, the underlying issue (#1485) has already been resolved. You no longer need to apply a workaround of converting the datetime columns to numerical. I would recommend upgrading your SDV version to the latest and re-trying the synthesis without any workarounds.
If you are continuing to have problems with it, please file a new issue so that we can take a look. Thanks.
Environment Details
Please indicate the following details about the environment in which you found the bug:
Error Description
I am encountering an InvalidDataError similar to the issue: https://github.com/sdv-dev/SDV/issues/1772. I am passing
participant__date_of_birth
column as a context column to thePARSynthesizer
, which fails during a call to.fit
function. Here is a full error message:I tried debugging the
.fit
method for thePARSynthesizer
and found that there is a call to the.preprocess
function, which alters my data. Here is an example:Input data:
The input data after a call to https://github.com/sdv-dev/SDV/blob/main/sdv/single_table/base.py#L471:
Can you please suggest a good way to handle datetime columns in the context?