Closed wessankey closed 5 years ago
Issue-Label Bot is automatically applying the label feature_request
to this issue, with a confidence of 0.99. Please mark this comment with :thumbsup: or :thumbsdown: to give our bot feedback!
Links: app homepage, dashboard and code for this bot.
Hi @westonsankey,
An "Optimus Dataframe" really does not exist, is just a Monkey patched Spark Dataframe.
If you already have a custom Spark Session you can do something like.
# Your custom spark session
spark = SparkSession.builder.appName('your_name_app').getOrCreate()
...
## Dataframe from external source
df.load_from_an_external_data_source()
op= Optimus(spark)
# Now the dataframe has all the Optimus functionality
df.table()
Let me know if this help
Hi @westonsankey,
Does this work for you?
@argenisleon - I'm a bit confused by the snippet you posted - just by passing the SparkSession
into the Optimus
constructor, this will add the functionality to dataframes created using that session?
Yes.
When you initialize Optimus op= Optimus(spark)
it attaches the Optimus functions dynamically (Monkey Path) to the spark Dataframe and uses your created session to make the Optimus operations.
What do you want to use from Optimus? Maybe we can improve the docs to make it easy for your use case.
I was looking into the data cleansing functionality. I'd like to broadly apply some cleansing rules across multiple DataFrames. The documentation on this functionality is fine, but I had a use case that requires using a custom data source API implementation, thus I couldn't use the built-in functions for JSON, Parquet, etc.
@westonsankey maybe this could help https://github.com/sourav-mazumder/Data-Science-Extensions/tree/master/spark-datasource-rest ?
@westonsankey Thinking more deeply about your issue we could borrow some code from the Optimus Enricher to connect to any API. Can you elaborate (pseudocode) how you think it will work?
@argenisleon - I'm not pulling the data from an API; it's coming from a file.
As far as I can tell, functionality to create an Optimus DataFrame from an existing Spark DataFrame is not supported. This would be useful when working with data from a source that Optimus doesn't currently support, but for which there is a custom Spark data source API connector.