Open Bowen0729 opened 2 years ago
Hi @Bowen0729 , you can set the na_values
parameter when you use pd.read_csv
to get the dataframe.
Hi @Bowen0729 , you can set the
na_values
parameter when you usepd.read_csv
to get the dataframe.
Thank you for your reply. Sure I can, but my data comes from multiple sources, such as parquet, hive, and so on. I'm considering if it is possible that dataprep can configure what the missing values are?
Hi @Bowen0729 , you can set the
na_values
parameter when you usepd.read_csv
to get the dataframe.Thank you for your reply. Sure I can, but my data comes from multiple sources, such as parquet, hive, and so on. I'm considering if it is possible that dataprep can configure what the missing values are?
Thanks for the reply! I now understand the use case. Yeah that's definitely something useful. I'm considering what's the efficient way to do this, as dataprep only process the dask or pandas dataframe. Did you use df.replace
as the current solution?
We use dataprep in a bigdata eco system, I used to contibute the doc of dataprep on yarn (https://github.com/sfu-db/dataprep/issues/771), but Pandas and Dask couldn't support some of datasources, such as Apache Hudi.
Therefore, on this basis, I made dataprep support Spark dataframe with Ray which help dataprep integrated into bigdata eco system.
This is my use case, I could add doc if it is necessary.
And I think df.replace could actually solve my problem without modified the dataprep. I will do some tests, thank you.
@Bowen0729 Thanks for the info. May I know how you made dataprep support Spark dataframe with Ray, did you modify the internal code?
No, I didn't modify the internal code, I just used raydp(spark on ray) [https://github.com/oap-project/raydp] to read a spark dataframe, and transfrom a spark dataframe to a dask dataframe with ray api, it is simple.
ray.init()
spark = raydp.init_spark()
spark_df = spark.sql("")
ray_df = ray.data.from_spark(spark_df)
dask_df = ray_df.to_dask()
create_report(dask_df)
No, I didn't modify the internal code, I just used raydp(spark on ray) [https://github.com/oap-project/raydp] to read a spark dataframe, and transfrom a spark dataframe to a dask dataframe with ray api, it is simple.
ray.init() spark = raydp.init_spark() spark_df = spark.sql("") ray_df = ray.data.from_spark(spark_df) dask_df = ray_df.to_dask() create_report(dask_df)
I see. Good to know this use case.
Is it necessary to add the doc for this case?
Is it necessary to add the doc for this case?
Yeah, I think it would be nice to have a use case doc. If you would like to contribute, you can add a notebook named use_case.ipynb
in https://github.com/sfu-db/dataprep/tree/develop/docs/source/user_guide/eda, where you can write down this use case :)
@Bowen0729 let me know if you would like to create that doc together, I've got a spark dataframe I could test that out on or can help write functions to support this Feature.
@Bowen0729 let me know if you would like to create that doc together, I've got a spark dataframe I could test that out on or can help write functions to support this Feature.
Sure, the commit haven't been merge, is there anything wrong?@jinglinpeng
Hi @Bowen0729 , thanks for the reminder, I've merged the PR. There are some problems in the doc-build workflow, and we're fixing it.
Hi @Bowen0729 , thanks for the reminder, I've merged the PR. There are some problems in the doc-build workflow, and we're fixing it.
Thanks! And what's your plan? Do you have some good ideas about this feature? I'd love to do it together @datatalking
@Bowen0729 I'm pretty new to the repo, still learning how to do stuff. Is there a list or should we start one in Discussions
'[https://github.com/sfu-db/dataprep/discussions]', or perhaps 'Projects' 'https://github.com/sfu-db/dataprep/projects?type=beta'. We can embrace and expand upon what was already done in the Titanic and 'house price' use cases.
In some cases, I treat "" or " " as a missing value, can I define which characters should be missing value?