nodestream-proj / nodestream

A Fast, Declarative, and Extensible ETL Framework for Graph Databases.
https://nodestream-proj.github.io/docs/
Apache License 2.0
34 stars 10 forks source link

Conditional Merges [REQUEST] #97

Open angelosantos4 opened 10 months ago

angelosantos4 commented 10 months ago

Is your feature request related to a problem? Please describe. When we ingest datatypes from different sources, we may run into the issue where we ingest data from an old source and one from a recent source with different properties. The current implementation builds a merge query which simply overwrites the properties on matched nodes based on which record was ingested first. I would like for there to be a way to conditionally change properties on a node.

Describe the solution you'd like When I create an interpretation within my pipeline, I would like to declare the following:

merge_condition: latest # Where we could have different options latest being greater value wins. default=None
condition_key: date_created # key of the value we are comparing with.
condition_value: !!python/jmespath date_created #The value from the record we are pulling from

This would then modify the merge query which currently performs the following for source nodes:

MERGE(node:$node_type) WHERE node.key = $key
ON CREATE
    SET node.param = param.value
ON MATCH
    SET node.param = param.value

I would like it to create the following:

MERGE(node:$node_type) WHERE node.key = $key
ON CREATE
    SET node.param = param.value
ON MATCH
    SET node.condition = CASE WHEN $condition THEN true ELSE false END // We need a variable for the condition in some way
    SET node.param = CASE WHEN node.condition THEN param.value ELSE node.param END
    // Find a way to unset node.condition

Where condition in our case would be (date_created > node.date_created)

Describe alternatives you've considered The alternative I can perform to ensure the recency of my data is I can schedule my pipelines such that the recent data comes in after the old data. In my pipeline, I can create an interpreter that makes a call to the database to get the value, then conditionally write to the database (this takes too long.)

Additional context

zprobst commented 9 months ago

I agree that this could be handy. Question marks around whether this is required in an ETL framework. Definitely willing to take PRs on this.

One major challenge is going to be around retaining the abstraction behind graph databases.