sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.28k stars 300 forks source link

Conditional sampling with empty conditions #337

Open fealho opened 3 years ago

fealho commented 3 years ago

Problem Description

When using conditional sampling without passing any conditions an unreadable error is thrown.

Expected behavior

Either it should simply return any samples, or it should throw an error saying no conditions were passed.

Additional context

The code below shows an example of this happening:

data = pd.DataFrame({
    "column1": [1.0, 0.5, 2.5] * 10,
    "column2": ["a", "b", "c"] * 10
})

model = CTGAN(epochs=1)
model.fit(data)
conditions = pd.DataFrame({
    "column2": []
})
sampled = model.sample(conditions=conditions)
csala commented 3 years ago

I lean towards just returning a DataFrame with no rows in it rather than throwing an error.

npatki commented 2 years ago

The new API exposes a sample_remaining_columns method for this use case. If you pass in an empty DataFrame, a ValueError is thrown:

from sdv.demo import load_tabular_demo
from sdv.tabular import GaussianCopula
import pandas as pd

data = load_tabular_demo('student_placements')
model = GaussianCopula()
model.fit(data)

conditions = pd.DataFrame({
    'gender': [],
})
model.sample_remaining_columns(conditions)

Output:

ValueError: No objects to concatenate

An error seems appropriate since it's not the intended usage but it can be more descriptive. We can change it to:

Error: Data is empty. Please input a DataFrame with 1 or more rows.