sdv-dev / SDGym

Benchmarking synthetic data generation methods.
Other
252 stars 58 forks source link

rfc: replace synthesizer function with class #89

Closed sbrugman closed 1 year ago

sbrugman commented 3 years ago

Description

SDGym's synthesizers all inherit from the Baseline class (or BaseSynthesizer class in previous versions). Users can provide custom synthesizer functions. The convenience inheritance is demonstrated throughout SDGym's code base and has all sort of other benefits. My suggestion would be to make the following changes:

These changes provide consistency between SDGym's native and user-provided synthesizers and clear distinction between fit and sample logic, at nearly no cost:

def synthesizer_function(real_data: dict[str, pandas.DataFrame],
                         metadata: sdv.Metadata) -> real_data: dict[str, pandas.DataFrame]:
    ...
    # do all necessary steps to learn from the real data
    # and produce new synthetic data that resembles it
    ...
    return synthetic_data

will become

from sdgym.synthesizers.base import Baseline

class MySynthesizer(Baseline):
    def fit(self, real_data: dict[str, pandas.DataFrame], metadata: sdv.Metadata) -> None:
        # ...
        # do all necessary steps to learn from the real data
        # ...

    def sample(self, n_samples: int) -> dict[str, pandas.DataFrame]:
        # and produce new synthetic data that resembles it
        return synthetic_data

More interestingly, this structure allows for capturing valuable metrics that are currently out of reach related to fit/sampling time and complexity (time measurements or maybe even this package). SDGym would this way be able to benchmark this aspect of a synthesizer as well, which can be an important decision criterion for which synthesizer is best for a given use case: if the user expects to sample large quantities of data then a longer fitting time would be acceptable at a lower sampling complexity.

The code that needs to be changed for this is minimal, however I wanted to make sure you see value in this point before drafting a PR.

sbrugman commented 3 years ago

@csala Any thoughts on this proposal?

sbrugman commented 2 years ago

@leix28 @katxiao would a PR be welcome?

csala commented 2 years ago

Hi @sbrugman I think that for now the exact change that you are proposing is not within the current SDGym roadmap, but some variation is:

My suggestion would be to make the following changes:

  • All synthesizers should inherit from a synthesizer base class (Baseline)
  • All synthesizers should implement a separate fit and sample method

To add some more context to it, the reason for which the required input is a function instead of a class is that wrapping a class-based synthesizer that follows the fit/sample abstraction within a single function that receives real data, runs fit/sample internally and returns a synthetic clone is far easier than the opposite approach of trying to adapt a functional based synthesizer into this fit/sample abstraction. Also, the current implementation of SDGym already supports class-based synthesizers that inherit from the Baseline class, so making this a hard requirement does not really expand the support, it only narrows it.

On the other hand, it is true that this support for class-based synthesizer is not really documented, so adding it to the docs would be interesting!

More interestingly, this structure allows for capturing valuable metrics that are currently out of reach related to fit/sampling time and complexity (time measurements or maybe even this package). SDGym would this way be able to benchmark this aspect of a synthesizer as well, which can be an important decision criterion for which synthesizer is best for a given use case: if the user expects to sample large quantities of data then a longer fitting time would be acceptable at a lower sampling complexity.

This is another story, and it could actually be interesting too! An option that would be acceptable without sacrificing the functional input, would be to modify the code to capture such metrics only when the provided synthesizer is a Baseline subclass. We could make it so that model_time stays the same and is always reported for all synthesizer, but for Baseline subclasses two new columns, fit_time and sample_time, are added to the output table.

sbrugman commented 2 years ago

Hi Carles, thanks for getting back at this. The clarification on why you would not like to impose the fit/sample abstraction on users is helpful. The backwards-compatibility argument also makes sense. Good to know that we can move forward on by documenting the class-based synthesizers and with the conditional extension of the benchmark with metrics on whether the implementation is Baseline-based or otherwise.

npatki commented 1 year ago

Hello, I'm jumping in here a few years after this initial conversation. Since 2021, we have made significant updates to the usage/API of SDGym as well as its functionality. And I believe that some key features that were discussed here have now been incorporated into the library.

  1. You can now supply a custom synthesizer by supplying 2 separate functions for fit and sample. For more information, see the custom synthesizer user guide. We will automatically use these functions to create a class for you.
  2. The benchmarking script now reports more granular results, such as time (fit time, sample time, evaluation time) and memory usage. See interpreting the results.

So I'm going to mark this issue as (more-or-less) resolved.

Apologies if I've overlooked any of the finer points in the discussion. If there is more to add, I'd recommend filing a new issue and we can start a new discussion based on the vision and capabilities of the newest SDGym library. Thanks!