sfu-db / dataprep

Open-source low code data preparation library in python. Collect, clean and visualization your data in python with a few lines of code.
http://dataprep.ai
MIT License
1.99k stars 203 forks source link

Optionally defer Dask compute in dataprep.clean #822

Open amanderson opened 2 years ago

amanderson commented 2 years ago

Is your feature request related to a problem? Please describe. I'm happy to see that cleaning methods are implemented with Dask. I've noticed that most, if not all, cleaning methods include

    with ProgressBar(minimum=1, disable=not progress):
        df, stats = dask.compute(df, stats)

before returning a result. In practice, however, it'd be quite common to use multiple cleaning methods in tandem with further manipulation of the data downstream. The current implementation prevents Dask optimisation over the whole transformation pipeline, as Dask can't make optimisation across computes.

Describe the solution you'd like When passing a Dask dataframe to a clean method, it would be nice to optionally defer compute and return back a Dask dataframe. I understand this would option would disable the progress bar and report, but these features are really only useful in an interactive notebook session.

Describe alternatives you've considered I've not come across alternatives.

qidanrui commented 2 years ago

Hi,@amanderson. Thanks for your advise! Actually we are currently considering to optimize the progress bar and remove the report part. Your advise will be very useful to us. I agree with your idea that " it'd be quite common to use multiple cleaning methods in tandem with further manipulation of the data downstream". It is also valuable for us to consider the parallel thing of running multiple clean functions in the same time.