sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.39k stars 317 forks source link

Encountering a ValueError, invalid literal for int() #2284

Open npatki opened 2 weeks ago

npatki commented 2 weeks ago

Filing this question on behalf of a user from a private thread.

Error Description

After Preprocess… Learning relationships… Modeling Tables, we are getting this error message -- ValueError: invalid literal for int() with base 10: 'sdv-pii-nv9ci'

npatki commented 2 weeks ago

Root cause of error

The particular error (invalid literal for int()) indicates that there is some sort of mismatch between:

In particular, it is indicating that there is some sort of column listed in your metadata as PII or unknown -- but it is actually represented as an integer in the data. This becomes an problem if the PII concept should generally be a string (eg. a person's name) because SDV is trying to convert it back to an integer.

Are there any column(s) in your data/metadata that match the description above?

Additional debugging questions

Despite this root cause, we had actually made a fix for this issue in #2064. Starting from SDV 1.15.0, it actually should not matter if there are small mismatches between the data and metadata.

So there are a few additional questions that would be helpful to debug this:

  1. Which version of SDV were you using? . If you are not on the latest version of SDV, please upgrade and try to run again. You can find the SDV version by running the code below:
    import sdv
    print(sdv.__version__)
  2. If you are able to isolate the column, it would be helpful to share the following info:
    • What is the metadata for that particular column? (sdtype as well as other info)
    • How is the data being stored in the column (eg. is it an integer, float?)
    • What kind of synthesizer are you using? (eg. GaussianCopula, HMA, etc.)
  3. Any other code snippets of how you are loading in the data and running the SDV code will be useful. (You do not need to share the data/metadata itself.)