sfu-db / dataprep

Open-source low code data preparation library in python. Collect, clean and visualization your data in python with a few lines of code.
http://dataprep.ai
MIT License
1.99k stars 203 forks source link

Can I define which characters should be missing value? #852

Open Bowen0729 opened 2 years ago

Bowen0729 commented 2 years ago

In some cases, I treat "" or " " as a missing value, can I define which characters should be missing value?

jinglinpeng commented 2 years ago

Hi @Bowen0729 , you can set the na_values parameter when you use pd.read_csv to get the dataframe.

Bowen0729 commented 2 years ago

Hi @Bowen0729 , you can set the na_values parameter when you use pd.read_csv to get the dataframe.

Thank you for your reply. Sure I can, but my data comes from multiple sources, such as parquet, hive, and so on. I'm considering if it is possible that dataprep can configure what the missing values are?

jinglinpeng commented 2 years ago

Hi @Bowen0729 , you can set the na_values parameter when you use pd.read_csv to get the dataframe.

Thank you for your reply. Sure I can, but my data comes from multiple sources, such as parquet, hive, and so on. I'm considering if it is possible that dataprep can configure what the missing values are?

Thanks for the reply! I now understand the use case. Yeah that's definitely something useful. I'm considering what's the efficient way to do this, as dataprep only process the dask or pandas dataframe. Did you use df.replace as the current solution?

Bowen0729 commented 2 years ago

We use dataprep in a bigdata eco system, I used to contibute the doc of dataprep on yarn (https://github.com/sfu-db/dataprep/issues/771), but Pandas and Dask couldn't support some of datasources, such as Apache Hudi.

Therefore, on this basis, I made dataprep support Spark dataframe with Ray which help dataprep integrated into bigdata eco system.

This is my use case, I could add doc if it is necessary.

And I think df.replace could actually solve my problem without modified the dataprep. I will do some tests, thank you.

jinglinpeng commented 2 years ago

@Bowen0729 Thanks for the info. May I know how you made dataprep support Spark dataframe with Ray, did you modify the internal code?

Bowen0729 commented 2 years ago

No, I didn't modify the internal code, I just used raydp(spark on ray) [https://github.com/oap-project/raydp] to read a spark dataframe, and transfrom a spark dataframe to a dask dataframe with ray api, it is simple.

ray.init()

spark = raydp.init_spark()

spark_df = spark.sql("")

ray_df = ray.data.from_spark(spark_df)

dask_df = ray_df.to_dask()

create_report(dask_df)
jinglinpeng commented 2 years ago

No, I didn't modify the internal code, I just used raydp(spark on ray) [https://github.com/oap-project/raydp] to read a spark dataframe, and transfrom a spark dataframe to a dask dataframe with ray api, it is simple.

ray.init()

spark = raydp.init_spark()

spark_df = spark.sql("")

ray_df = ray.data.from_spark(spark_df)

dask_df = ray_df.to_dask()

create_report(dask_df)

I see. Good to know this use case.

Bowen0729 commented 2 years ago

Is it necessary to add the doc for this case?

jinglinpeng commented 2 years ago

Is it necessary to add the doc for this case?

Yeah, I think it would be nice to have a use case doc. If you would like to contribute, you can add a notebook named use_case.ipynb in https://github.com/sfu-db/dataprep/tree/develop/docs/source/user_guide/eda, where you can write down this use case :)

datatalking commented 2 years ago

@Bowen0729 let me know if you would like to create that doc together, I've got a spark dataframe I could test that out on or can help write functions to support this Feature.

Bowen0729 commented 2 years ago

@Bowen0729 let me know if you would like to create that doc together, I've got a spark dataframe I could test that out on or can help write functions to support this Feature.

Sure, the commit haven't been merge, is there anything wrong?@jinglinpeng

jinglinpeng commented 2 years ago

Hi @Bowen0729 , thanks for the reminder, I've merged the PR. There are some problems in the doc-build workflow, and we're fixing it.

Bowen0729 commented 2 years ago

Hi @Bowen0729 , thanks for the reminder, I've merged the PR. There are some problems in the doc-build workflow, and we're fixing it.

Thanks! And what's your plan? Do you have some good ideas about this feature? I'd love to do it together @datatalking

datatalking commented 2 years ago

@Bowen0729 I'm pretty new to the repo, still learning how to do stuff. Is there a list or should we start one in Discussions '[https://github.com/sfu-db/dataprep/discussions]', or perhaps 'Projects' 'https://github.com/sfu-db/dataprep/projects?type=beta'. We can embrace and expand upon what was already done in the Titanic and 'house price' use cases.