sdv-dev / SDV

Synthetic data generation for tabular data
2.28k stars 300 forks source link

Expose parameters from DeepEcho `PARSynthesizer` in SDV (eg. `data_types`) #1164

Open Mohamed209 opened 1 year ago

Mohamed209 commented 1 year ago

Environment details

Question description

I have a dataset where real features seems to follow NegativeBinomial distribution so per the paper


I want to force the loss during training for some features to use NegativeBinomial distribution


        for field in self._output_columns:
            dtype = timeseries_data[field].dtype
            kind = dtype.kind
            if kind in ('i', 'f'):
                data_type = 'continuous'
            elif kind in ('O', 'b'):
                data_type = 'categorical'
                raise ValueError(f'Unsupported dtype {dtype}')

all feature will be continuous , so while the training

for key, props in self._data_map.items():
            if props['type'] in ['continuous', 'timestamp']:
                mu_idx, sigma_idx, missing_idx = props['indices']
                mu = Y_padded[:, :, mu_idx]
                sigma = torch.nn.functional.softplus(Y_padded[:, :, sigma_idx])
                missing = torch.nn.LogSigmoid()(Y_padded[:, :, missing_idx])

                for i in range(batch_size):
                    dist = torch.distributions.normal.Normal(
                        mu[:seq_len[i], i], sigma[:seq_len[i], i])
                    log_likelihood += torch.sum(dist.log_prob(X_padded[-seq_len[i]:, i, mu_idx]))

                    p_true = X_padded[:seq_len[i], i, missing_idx]
                    p_pred = missing[:seq_len[i], i]
                    log_likelihood += torch.sum(p_true * p_pred)
                    log_likelihood += torch.sum((1.0 - p_true) * torch.log(
                        1.0 - torch.exp(p_pred)))

            elif props['type'] in ['count']:
                r_idx, p_idx, missing_idx = props['indices']
                r = torch.nn.functional.softplus(Y_padded[:, :, r_idx]) * props['range']
                p = torch.sigmoid(Y_padded[:, :, p_idx])
                x = X_padded[:, :, r_idx] * props['range']
                missing = torch.nn.LogSigmoid()(Y_padded[:, :, missing_idx])

                for i in range(batch_size):
                    dist = torch.distributions.negative_binomial.NegativeBinomial(
                        r[:seq_len[i], i], p[:seq_len[i], i], validate_args=False)
                    log_likelihood += torch.sum(dist.log_prob(x[:seq_len[i], i]))

                    p_true = X_padded[:seq_len[i], i, missing_idx]
                    p_pred = missing[:seq_len[i], i]
                    log_likelihood += torch.sum(p_true * p_pred)
                    log_likelihood += torch.sum((1.0 - p_true) * torch.log(
                        1.0 - torch.exp(p_pred)))

all my features will be modeled as gaussian , which is not correct for my case

Mohamed209 commented 1 year ago

seems I found a workaround with using PAR models from deepecho as standalone library not from sdv So the question now , any intentions to support data_types dict to be passed to PAR models from sdv ?

npatki commented 1 year ago

Hi @Mohamed209, glad you found that the DeepEcho library had the settings you needed.

How about we turn this issue into a feature request for supporting all the PAR data_types through the SDV library? While this is not currently on our roadmap, this type of feedback will help us prioritize it in the future.

npatki commented 1 year ago

(Following the previous comment, I've re-titled this and marked it as a feature request)