As described in issue #1310, the single table sample method has the following functionality:
If you specify an output_filepath, the sampling function will write to the filepath after every batch, ensuring that you do not lose previously sampled data if there is a crash
If you do not specify an output_filepath, then the sampling function will still save batched results to .sample.csv.temp (still want to ensure that you have access to some data in the event of a crash). Once the sampling completes, we delete this file.
Problems:
When you specify that output_filepath=None, SDV creates a temporary file anyway. This is unintuitive for users who assume that None means that no output filepath will be provided. It is also unintuitive that SDV deletes the file after-the-fact.
In the event of a crash, there is a message asking users to check .sample.csv.temp, which is unexpected. This was meant to give the user some data rather than nothing. But practice there are always 0 rows in this file, because the default batch_size is kept equal to the sample size.
There is no way to turn off the functionality for saving to a file. Users in both #1310 and #2029 mention that they'd like to turn it off because it does not work well within their filesystem (they are not worried about crashes during sampling).
Expected behavior
If the output_filepath=None, we should no longer create or write to a temporary file. This will align with the user expectations and it will enable more forms of usage. In the event of a crash, update the error message. There will no longer be a file to reference.
If output_filepath is provided, we should create and periodically write to the file as we do today. In the event of a crash, the error message should continue to ask the user to check the file.
Default: We can make the default None. In the future, we should figure out a better default for batch_size for very large samples. In such a case, it would make sense to have a output filepath supplied as a default.
Problem Description
As described in issue #1310, the single table
sample
method has the following functionality:output_filepath
, the sampling function will write to the filepath after every batch, ensuring that you do not lose previously sampled data if there is a crashoutput_filepath
, then the sampling function will still save batched results to.sample.csv.temp
(still want to ensure that you have access to some data in the event of a crash). Once the sampling completes, we delete this file.Problems:
output_filepath=None
, SDV creates a temporary file anyway. This is unintuitive for users who assume thatNone
means that no output filepath will be provided. It is also unintuitive that SDV deletes the file after-the-fact..sample.csv.temp
, which is unexpected. This was meant to give the user some data rather than nothing. But practice there are always 0 rows in this file, because the defaultbatch_size
is kept equal to the sample size.Expected behavior
output_filepath=None
, we should no longer create or write to a temporary file. This will align with the user expectations and it will enable more forms of usage. In the event of a crash, update the error message. There will no longer be a file to reference.output_filepath
is provided, we should create and periodically write to the file as we do today. In the event of a crash, the error message should continue to ask the user to check the file.Default: We can make the default
None
. In the future, we should figure out a better default forbatch_size
for very large samples. In such a case, it would make sense to have a output filepath supplied as a default.