Open brendalf opened 2 years ago
@brendalf On your specific error, can you try returning a dictionary from your function and constructing the SparkPlan
object inside _save
from that dictionary?
Questions:
- Is there a way to provide runtime values to the dataset together with the data?
Could I put these values in the context and retrieve them inside the custom dataset?
- I saw a method to load the current kedro context, but that method was removed.
Since Kedro tries to abstract data saving/loading from logic, I don't think this is directly supported. Of the top of my head, what you could do is return these runtime values from nodes, either explicitly or using hooks to pass that extra output.
Hi @deepyaman. It's working now, both returning a SparkPlan or a dict. I realized that I had a typo in the catalog dataset name. Thanks.
Since Kedro tries to abstract data saving/loading from logic, I don't think this is directly supported. Of the top of my head, what you could do is return these runtime values from nodes, either explicitly or using hooks to pass that extra output.
Can you provide a short example?
Are you running this with ParallelRunner? That's a common issue here.
No. I'm not
Although I solved the issue by wrapping the data and the parameters inside a class, I think it would be good to have this feature handled by Kedro in the future. Thanks for the support. Should I close this?
Hi @brendalf I've just realised this is possibly resolved by tweaking the copy_mode
of the memory dataset when passed into the next node:
https://kedro.readthedocs.io/en/latest/_modules/kedro/io/memory_dataset.html
kedro.io.core.DataSetError: Failed while saving data to data set MemoryDataSet().
cannot pickle '_thread.RLock' object
These errors almost always come from serialization, I think we had similar issue with TensorFlow
object, quick solution is the copy_mode
that Joel mentioned above.
@noklam do you think we could catch this pickling error and recommend the solution? It's a hard one to debug for users in this situation.
Hi @datajoely In my case I didn't wanted to save in a MemoryDataset, that was happening because I had a type between the data catalog entry and the name I actually wrote as output for the node. I think the problem happened because the memory dataset tried to serialize a spark dataframe object.
Sorry - MemoryDataSet
is used to dynamically pass data between nodes automatically, if you look at the implementation we automatically do this for native Spark dataframes:
So you can do this by explicitly declaring MemoryDataSets
in the catalog.
I also think if you were to subclass our spark.SparkDataSet
or spark.DeltaTableDataSet
you would benefit from this too.
Do you think it would be nice to have in the future a way to send runtime calculated values as extra parameters to the dataset? For now, I solved by wrapping the values and the dataset inside a class, that my custom dataset accepts to save. If not, we can close this issue.
@brendalf Could you provide an example of that?
@brendalf or perhaps - why can't you just return runtime data as inputs to another node, does it need to be in the DataSet implementation?
My custom dataset needs to receive two things:
Example:
I need to replace the data inside the Delta Table for a specific set of dates.
I have a date_start, a date_end and a lookback parameters defined inside the parameters.yml
and then inside the node I actually load data from date_start - lookback
to date_end
, so I need to replace the same dates in the output table.
I thought about three solutions to solve this:
I actually solved the problem with the first approach, but it's problematic, since now when I want to join nodes together, nodes downstream won't receive the spark dataset plan with lazy evaluation anymore, but a instance of this class.
I couldn't find how to implement the second approach. Maybe Kedro could automatically send the context to the dataset as kwargs?
The problem with the third one is that I want to keep using the data catalog.
Hello folks, any news here?
I think this option is most common amongst the community:
Create a class that accepts the data and the replace where query, so the node can send everything I need to the custom dataset that accepts this class instead of just the spark dataset.
In Kedro the nodes should be pure python functions with no knowledge of IO, so you should never have a context available there.
I can't use the parameters inside the save_args key for the custom dataset because the replace values are also calculated during execution depending on other pipeline parameters, like DATE_START and LOOKBACK.
The question of dynamic datasets like these has come up recently in some user conversations. We haven't started thinking on how to do it yet.
Description
Hello there. I created a custom dataset to handle our Spark Delta Tables. The problem is that the custom dataset needs a replace-where string defining what partition should be overwritten after the data is generated inside the node. Catalog definition:
I can't use the parameters inside the
save_args
key for the custom dataset because the replace values are also calculated during execution depending on other pipeline parameters, like DATE_START and LOOKBACK.I tried to create a class to be the interface between the nodes and the custom catalog, this class holds the Spark Dataframe and extra values, but Kedro fails when trying to convert to a pickle: Node return:
Custom Dataset save method:
Error received:
Questions:
Edit 1 - 2022-07-25:
The error above was happening because I typed the wrong dataset in the node outputs, so Kedro tried to save as a MemoryDataset. I solved the problem of sending extra parameters by using this
SparkPlan
wrapper around every save and load from my custom dataset.