sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.32k stars 304 forks source link

If no filepath is provided, do not create a file during `sample` #2042

Closed npatki closed 3 months ago

npatki commented 4 months ago

Problem Description

As described in issue #1310, the single table sample method has the following functionality:

Problems:

  1. When you specify that output_filepath=None, SDV creates a temporary file anyway. This is unintuitive for users who assume that None means that no output filepath will be provided. It is also unintuitive that SDV deletes the file after-the-fact.
  2. In the event of a crash, there is a message asking users to check .sample.csv.temp, which is unexpected. This was meant to give the user some data rather than nothing. But practice there are always 0 rows in this file, because the default batch_size is kept equal to the sample size.
  3. There is no way to turn off the functionality for saving to a file. Users in both #1310 and #2029 mention that they'd like to turn it off because it does not work well within their filesystem (they are not worried about crashes during sampling).

Expected behavior

Default: We can make the default None. In the future, we should figure out a better default for batch_size for very large samples. In such a case, it would make sense to have a output filepath supplied as a default.