Open npatki opened 5 days ago
Hi @ardulat, let's use this thread to specifically discuss the null values in your sequence index column (date
).
I've been trying to reproduce this with the train.csv
and metadata you provided in this comment but I'm just not able to -- every time I create synthetic data, the date
column is completely filled out (no nulls are produced).
Interestingly, this has been an issue in prior versions of SDV but we had fixed it starting from SDV version 1.13.0. So just as a sanity-check, could you check your SDV version and let us know if it's the latest?
import sdv
print(sdv.__version__)
Sometimes, doing any data manipulation in Python (before fitting the PARSynthesizer) could lead to issues. I'm curious how you are loading in your data, and whether you are modifying it in any way before fitting PAR? Eg. below
import pandas as pd
data = pd.read_csv('train.csv') # is this how you are loading in your data? or are you using another method?
# are you modifying the data in any way before creating PAR and fitting it?
# TODO
synthesizer = PARSynthesizer(metadata, context_columns=['gender', 'date_of_birth'])
synthesizer.fit(data)
...
On a similar note, I'm curious what is the storage type (dtype) of the column that is ultimately going into fit
?
print(data['date'].dtype)
Finally, since most of our data manipulation is done in pandas or numpy, would it be possible to share these versions of your library?
import numpy as np
import pandas as pd
print('Numpy version:', np.__version__)
print('Pandas version:', pd.__version__)
Thank you, and hope to get to the bottom of this soon!
Environment Details
Error Description
This bug was first described in #2241 (which contains the metadata and more conversation). A summary is pasted below --
The issue is that the date (not data) column, which is a sequence_index in metadata, is producing 36% of nulls. I've been using the MissingValueSimilarity metric from the SDMetrics, and it shows good results for the similarity of null values in other columns. However, for the column representing dates/sequence_index, the model sometimes produces a lot of null values, resulting in MissingValueSimilarity smaller than 1.0. In other words, the actual data does not have many nulls, but the sampled data has relatively more nulls.
The
date
column is listed as asequence_index
with the following metadata info:Steps to reproduce
See #2241 for more information.