hitsz-ids / synthetic-data-generator

SDG is a specialized framework designed to generate high-quality structured tabular data.
Apache License 2.0
3.27k stars 545 forks source link

Bugfix: fix gussian copula segmentfault error #180

Closed MooooCat closed 5 months ago

MooooCat commented 5 months ago

Description

Motivation and Context

How has this been tested?

Types of changes

Checklist:

sweep-ai[bot] commented 5 months ago

Sweep: PR Review

Authors of pull request: @MooooCat, @pre-commit-ci[bot]

sdgx/models/statistics/single_table/base.py

Renamed the class SynthesizerModel to StatisitcSynthesizerModel to reflect a more descriptive class name.

Sweep Found These Issues

  • The class name StatisitcSynthesizerModel contains a typo and should be corrected to StatisticSynthesizerModel.
  • https://github.com/hitsz-ids/synthetic-data-generator/blob/2491498cc7696ae6cb78356a18fc39d7fd9d771a/sdgx%2Fmodels%2Fstatistics%2Fsingle_table%2Fbase.py#L12 [View Diff](https://github.com/hitsz-ids/synthetic-data-generator/pull/180/files#diff-bc21248690a57473ccf8d2a81b17dd9a75174af4eb10085dd10ea6a8ae455fa7R12)

sdgx/models/statistics/single_table/copula.py

Updated the fit method to use Metadata and DataLoader, changed the base class to StatisitcSynthesizerModel, and modified the column validation process.

Sweep Found These Issues

  • The fit method now requires metadata and dataloader arguments, which could cause issues if these are not provided or are incorrectly formatted.
  • https://github.com/hitsz-ids/synthetic-data-generator/blob/2491498cc7696ae6cb78356a18fc39d7fd9d771a/sdgx%2Fmodels%2Fstatistics%2Fsingle_table%2Fcopula.py#L133-L172 [View Diff](https://github.com/hitsz-ids/synthetic-data-generator/pull/180/files#diff-8e189898fff802d8811a04155e8183502588ff88a643598c7f5c0c437487478fR133-R172)
  • The fit method now depends on DataLoader and Metadata objects, which could introduce bugs if these objects are not correctly instantiated or if their methods do not behave as expected.
  • https://github.com/hitsz-ids/synthetic-data-generator/blob/2491498cc7696ae6cb78356a18fc39d7fd9d771a/sdgx%2Fmodels%2Fstatistics%2Fsingle_table%2Fcopula.py#L133-L172 [View Diff](https://github.com/hitsz-ids/synthetic-data-generator/pull/180/files#diff-8e189898fff802d8811a04155e8183502588ff88a643598c7f5c0c437487478fR133-R172)

sdgx/synthesizer.py

Added support for StatisitcSynthesizerModel in the Synthesizer class and removed the unpacking of model_fit_kwargs in the fit method.

Sweep Found These Issues

  • The removal of model_fit_kwargs in the fit method call could omit important parameters that influence the model fitting process.
  • https://github.com/hitsz-ids/synthetic-data-generator/blob/2491498cc7696ae6cb78356a18fc39d7fd9d771a/sdgx%2Fsynthesizer.py#L307-L311 [View Diff](https://github.com/hitsz-ids/synthetic-data-generator/pull/180/files#diff-4986c21b4838a07a79f5aa48e24519b2330512c810ed7d32e7e96900c370ec11R307-R311)
Potential Issues

Sweep isn't 100% sure if the following are issues or not but they may be worth taking a look at.

  • The removal of **(model_fit_kwargs or {}) in the fit method could omit important parameters that influence the model fitting process, potentially leading to unexpected behavior or reduced functionality.
  • https://github.com/hitsz-ids/synthetic-data-generator/blob/2491498cc7696ae6cb78356a18fc39d7fd9d771a/sdgx%2Fsynthesizer.py#L311 [View Diff](https://github.com/hitsz-ids/synthetic-data-generator/pull/180/files#diff-4986c21b4838a07a79f5aa48e24519b2330512c810ed7d32e7e96900c370ec11R311)

tests/models/test_copula.py

Updated the test_gaussian_copula function to use new metadata and data loader fixtures, and refactored the GaussianCopulaSynthesizer instantiation and fit method calls accordingly.

Potential Issues

Sweep isn't 100% sure if the following are issues or not but they may be worth taking a look at.

  • The discrete_cols assignment to model.discrete_cols is incorrect as discrete_cols is a generator and not directly accessible.
  • https://github.com/hitsz-ids/synthetic-data-generator/blob/2491498cc7696ae6cb78356a18fc39d7fd9d771a/tests%2Fmodels%2Ftest_copula.py#L23 [View Diff](https://github.com/hitsz-ids/synthetic-data-generator/pull/180/files#diff-bc8922767c08a1cebdf02e21a31b74911fbca2d26db9053be00ac0581889112dR23)
  • The discrete_cols assignment to model.discrete_cols is incorrect as discrete_cols is a generator and not directly accessible.
  • https://github.com/hitsz-ids/synthetic-data-generator/blob/2491498cc7696ae6cb78356a18fc39d7fd9d771a/tests%2Fmodels%2Ftest_copula.py#L23 [View Diff](https://github.com/hitsz-ids/synthetic-data-generator/pull/180/files#diff-bc8922767c08a1cebdf02e21a31b74911fbca2d26db9053be00ac0581889112dR23)

iokk3732 commented 5 months ago

fix done