Closed mjbooo closed 1 year ago
Dear @mjbooo,
Thank you for spotting and reporting the issue.
Certainly, we need to address this. I will work on resolving it when I have time. Alternatively, I'm open to receiving a PR.
Thank you again for reporting the issue!
The code has been updated. GReaT now generates samples with missing values if they exist in the original dataset. To skip missing values, add drop_nan=True in the sample function:
synthetic_data = model.sample(n_samples=200, drop_nan=True)
Hello, authors! I wanted to express my gratitude for the excellent work you've done!
I do have a question regarding the generation of missing values. I'm a bit puzzled about how the GReaT model handles the generation of missing values (NaN) in the current implementation.
When I input a DataFrame, GReaT automatically converts it into text like 'column1 is value1, column2 is value2, ...'. If value1 happens to be a null value, the resulting text becomes 'column1 is None, column2 is value2, ...'. Then, it goes into the LLM backbone (GPT2).
However, I've noticed that the 'column1 is None' part gets dropped by the code below (# Remove rows with flawed numerical values). This happens because when applying pd.to_numeric with the 'coerce' option to 'None' (as a string), it raises an error. Consequently, the corresponding value is converted to null. However, by selecting only non-null values using .notnull(), all rows with null values are dropped here.
I recently experimented with the .fit() and .sample() functions to generate a DataFrame with several columns containing missing values (e.g., 'sick' dataset).
If I've made any mistakes in my understanding, please let me know.
Once again, thank you for your assistance!