sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.36k stars 315 forks source link

Unexpected null values in `sequence_index` column #2276

Open npatki opened 5 days ago

npatki commented 5 days ago

Environment Details

Error Description

This bug was first described in #2241 (which contains the metadata and more conversation). A summary is pasted below --

The issue is that the date (not data) column, which is a sequence_index in metadata, is producing 36% of nulls. I've been using the MissingValueSimilarity metric from the SDMetrics, and it shows good results for the similarity of null values in other columns. However, for the column representing dates/sequence_index, the model sometimes produces a lot of null values, resulting in MissingValueSimilarity smaller than 1.0. In other words, the actual data does not have many nulls, but the sampled data has relatively more nulls.

The date column is listed as a sequence_index with the following metadata info:

    "date": {
      "sdtype": "datetime",
      "datetime_format": "%Y-%m-%d"
    },

Steps to reproduce

See #2241 for more information.

npatki commented 5 days ago

Hi @ardulat, let's use this thread to specifically discuss the null values in your sequence index column (date).

I've been trying to reproduce this with the train.csv and metadata you provided in this comment but I'm just not able to -- every time I create synthetic data, the date column is completely filled out (no nulls are produced).

Interestingly, this has been an issue in prior versions of SDV but we had fixed it starting from SDV version 1.13.0. So just as a sanity-check, could you check your SDV version and let us know if it's the latest?

import sdv
print(sdv.__version__)

Next steps for debugging

Sometimes, doing any data manipulation in Python (before fitting the PARSynthesizer) could lead to issues. I'm curious how you are loading in your data, and whether you are modifying it in any way before fitting PAR? Eg. below

import pandas as pd

data = pd.read_csv('train.csv') # is this how you are loading in your data? or are you using another method?

# are you modifying the data in any way before creating PAR and fitting it?
# TODO

synthesizer = PARSynthesizer(metadata, context_columns=['gender', 'date_of_birth'])
synthesizer.fit(data)
...

On a similar note, I'm curious what is the storage type (dtype) of the column that is ultimately going into fit?

print(data['date'].dtype)

Finally, since most of our data manipulation is done in pandas or numpy, would it be possible to share these versions of your library?

import numpy as np
import pandas as pd

print('Numpy version:', np.__version__)
print('Pandas version:', pd.__version__)

Thank you, and hope to get to the bottom of this soon!