hi-primus / optimus

:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark
https://hi-optimus.com
Apache License 2.0
1.48k stars 232 forks source link

Add ability to create Optimus DF from Spark DF #645

Closed wessankey closed 5 years ago

wessankey commented 5 years ago

As far as I can tell, functionality to create an Optimus DataFrame from an existing Spark DataFrame is not supported. This would be useful when working with data from a source that Optimus doesn't currently support, but for which there is a custom Spark data source API connector.

issue-label-bot[bot] commented 5 years ago

Issue-Label Bot is automatically applying the label feature_request to this issue, with a confidence of 0.99. Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

argenisleon commented 5 years ago

Hi @westonsankey,

An "Optimus Dataframe" really does not exist, is just a Monkey patched Spark Dataframe.

If you already have a custom Spark Session you can do something like.

# Your custom spark session
spark = SparkSession.builder.appName('your_name_app').getOrCreate()
...
## Dataframe from external source
df.load_from_an_external_data_source()

op= Optimus(spark)
# Now the dataframe has all the Optimus functionality
df.table()

Let me know if this help

argenisleon commented 5 years ago

Hi @westonsankey,

Does this work for you?

wessankey commented 5 years ago

@argenisleon - I'm a bit confused by the snippet you posted - just by passing the SparkSession into the Optimus constructor, this will add the functionality to dataframes created using that session?

argenisleon commented 5 years ago

Yes.

When you initialize Optimus op= Optimus(spark) it attaches the Optimus functions dynamically (Monkey Path) to the spark Dataframe and uses your created session to make the Optimus operations.

What do you want to use from Optimus? Maybe we can improve the docs to make it easy for your use case.

wessankey commented 5 years ago

I was looking into the data cleansing functionality. I'd like to broadly apply some cleansing rules across multiple DataFrames. The documentation on this functionality is fine, but I had a use case that requires using a custom data source API implementation, thus I couldn't use the built-in functions for JSON, Parquet, etc.

argenisleon commented 5 years ago

@westonsankey maybe this could help https://github.com/sourav-mazumder/Data-Science-Extensions/tree/master/spark-datasource-rest ?

argenisleon commented 5 years ago

@westonsankey Thinking more deeply about your issue we could borrow some code from the Optimus Enricher to connect to any API. Can you elaborate (pseudocode) how you think it will work?

wessankey commented 5 years ago

@argenisleon - I'm not pulling the data from an API; it's coming from a file.