Open pbilling opened 1 year ago
I think a problem now is that the message classes were overfit to deal with the use case of triggers and now don't fit the use case of launching jobs. Better approach might be to simplify the message classes and put more message parsing logic in the trigger function.
One of the challenges is that the trigger function is designed to respond to either a node or a relationship, but the way that Neo4j return results for relationships is by returning the nodes and relationships because then the relationships can just reference the nodes.
Current QueryResponseWriter methods for sharing results is (a) sharing all the nodes or (b) sharing one relationship per message. So yeah, a thing I could do would be to increase the complexity of the triggers logic.
The DatabaseTrigger
class has an aggregate_results
boolean attribute. If true
, db-query
will return all nodes returned by the query, in a single message. If false
, it will call the QueryResponse.generate_separate_entity_jsons()
method.
One issue here is the vagueness of the term aggregate_results
. One interpretation is that it packages all nodes and relationships into a single message, but that's not the case. It packages all the nodes into a single message but omits any relationships. I know this because the method is called return_json_with_all_nodes()
(and I looked at the code), so at least I named something decently. The generate_separate_entity_jsons()
method will either return each relationship (with node information) in a separate message OR each node in a separate message. Again, it only returns one kind of entity (node or relationship) based on the value of the DatabaseQuery.pattern
attribute ['node', 'relationship'].
classDiagram
class QueryResponse{
+ nodes (list)
+ relationships (list)
}
class QueryResponseWriter{
+str message_kind
+str query_name
+dict result_summary
+neo4j.Graph graph
+list nodes
+list relationships
+list supported_patterns
+str job_request
+return_json_with_all_nodes()
+RETURN_JSON_WITH_ALL_ENTITIES()
+generate_separate_entity_jsons()
+_get_result_summary_dict()
+_get_relationship_dict()
+_get_node_dict()
}
Instead of writing classes to transform the neo4j.Graph and neo4j.ResultSummary outputs, the classes should just be able to transform them to structured text and then reconstitute them.
Use cases:
flowchart TD
neo4j(neo4j) --> graf[neo4j.Graph]
neo4j --> summary[neo4j.ResultSummary]
graf --> writer[QueryResponseWriter]
summary --> writer
writer --> json[JSON]
json --> reader[QueryResponseReader]
reader --> graf2[neo4j.Graph]
reader --> summary2[neo4j.ResultSummary]
Easy part will be converting Neo4j objects to JSON, hard part will be converting back to Neo4j objects. The Neo4j Python driver uses "Hydrator" classes to create instances of Node, Relationship and Graph classes.
Example of using hydrators to create a node and relationship: https://github.com/neo4j/neo4j-python-driver/blob/5.0/tests/unit/common/codec/hydration/v2/test_graph_hydration.py.
I feel like it's more intuitive to interact with the v4.4 hydrator: https://github.com/neo4j/neo4j-python-driver/blob/4.4/neo4j/graph/__init__.py.
Maybe not though, it just looks more complicated in v5.0: https://github.com/neo4j/neo4j-python-driver/blob/a8a2b04da36662f0e8d3a0a2c3a9af6e7b9d2f2a/src/neo4j/_codec/hydration/v1/hydration_handler.py#L57.
Hydration tests in v5.0: https://github.com/neo4j/neo4j-python-driver/blob/5.0/tests/unit/common/codec/hydration/v2/test_graph_hydration.py.
The v2 TestGraphHydration
imports v1 TestGraphHydration
imports HydrationHandlerTestBase
which is where the hydration_scope
pytest fixture comes from.
And this is where you can see where hydration_scope
comes from: https://github.com/neo4j/neo4j-python-driver/blob/a8a2b04da36662f0e8d3a0a2c3a9af6e7b9d2f2a/tests/unit/common/codec/hydration/_common.py#L45.
The HydrationScope
definition: https://github.com/neo4j/neo4j-python-driver/blob/5.0/src/neo4j/_codec/hydration/_common.py.
Here is where the graph hydration function are defined: https://github.com/neo4j/neo4j-python-driver/blob/a8a2b04da36662f0e8d3a0a2c3a9af6e7b9d2f2a/src/neo4j/_codec/hydration/v1/hydration_handler.py#L60.
translate_graph_to_json()
methodtranslate_graph_to_json()
using GraphHydrator
translate_json_to_graph()
methodtranslate_json_to_graph
I can use the code I wrote for testing graph->json, with hydrators, for implementing translate_json_to_graph()
.
Now that I have functions to translate Graph to JSON and back, I still need
There are now only (2) types of Trellis messages
QueryResponse(sender, seed_id, previous_event_id, query_name, result_summary_json, graph_json, job_request) QueryRequest(sender, seed_id, previous_event_id, query_name, query_parameters, custom, cypher, write_transaction, aggregate_results, publish_to, returns)
Each kind of message had its own pair of reader/writer classes because the Trellis message classes were designed to take input directly from the systems providing the data: the Neo4j and Pub/Sub drivers. Two issues with this were
A general design principle I am drawing from this is to keep at least 1 degree of separation from my systems and systems owned by different organizations. Instead of having my classes take their inputs directly, I can write helper functions to translate data from other systems into structures that fit my classes. When I want to expand Trellis to work on other platforms or with other databases, I can still (potentially) use the same classes, just write new helper functions. This should also simplify the Trellis class structure because instead of having separate reader/writer classes (e.g. QueryResponseReader and QueryResponseWriter) I can just use the same one consistently (e.g. QueryResponse).
This approach also supports a more intuitive mental model. A thing that kept tripping me up when thinking about Trellis messages is that none of the classes actually represented the message, they just represented the act of writing of reading the message. Which doens't really fit an "object-oriented" approach in my mind, since the programming objects I was creating represented actions/verbs rather than objects/nouns.
Update: Caveat to this is when using industry standards. Using an existing standard is probably better than inventing a new approach. Thought of this when looking at how cloud functions use CloudEvents.
I just argued that as design principle, my code should be abstracted away from third party systems. For instance, not passing neo4j class instances directly as arguments to initialized my class instances. Instead, first convert them to a system agnostic format (JSON) and then load them. But then, should I still write my functions to interact with neo4j class instances?
Right now, the model I'm envisioning is that Neo4j query results will be translated to JSON for transmission from one serverless function to another. When they are received, they will be reconstituted into neo4j class instances and that's how the function will interact with them.
My thinking with doing it this way is that consistently modelling graph data the same way will make it easier for developers to understand and model mentally. Also, they can just reference the neo4j docs to understand the data structure. A potential drawback is that we are tying our system to another system, even though functions other than db-query
will not interact with the database, so using the neo4j structures is purely for consistency of form rather than function. But if we don't keep updating our code to be consistent with the latest version of the neo4j driver, it could also cause confusion if the data structures change. We may also find that a different structure would better fit our use cases.
And if we move to a different database, it might be weird and confusing to use the neo4j library to model data coming from a different source. I think for now, there's no reason I need to translate JSON graphs back into neo4j structures, so downstream functions could just load the JSON into a dictionary or use the translate_json_to_graph()
function to translate back to neo4j, if necessary. I liked the idea of completing the circle of graph -> JSON -> graph, but that's mostly just useful for testing and validation. Not sure if there is a further use case.
flowchart TD
neo4j(neo4j) --> graf[neo4j.Graph]
neo4j --> summary[neo4j.ResultSummary]
graf -- trellis.translate_graph_to_json --> jsonGraph[json_graph]
summary -- trellis.translate_summary_to_json --> jsonSummary[json_summary]
jsonGraph --> qr1[trellis.QueryResponse]
jsonSummary --> qr1[trellis.QueryResponse]
qr1 -- Pub/Sub --> qr2[trellis.QueryResponse]
qr2 -- json.loads --> Dict
qr2 -- trellis.translate_json_to_graph --> neo4j.Graph
Added placeholder test fortransate_result_summary_to_json()
because I couldn't find much teting information and it's non-essential (https://github.com/neo4j/neo4j-python-driver/blob/4.4/tests/unit/work/test_result.py#LL340C2-L340C2)
graph TD
gcs[Cloud Storage] -- CloudEvent --> create-blob-node
create-blob-node -- trellis.QueryRequest --> db-query
db-query -- trellis.QueryResponse --> check-triggers
check-triggers -- trellis.QueryRequest --> db-query
db-query -- trellis.QueryResponse --> job-launcher
job-launcher -- trellis.QueryRequest --> db-query
job-launcher -- dsubCommand --> api[Life Sciences API]
api -- dataObject --> gcs
In some cases, after running a query, I will want to aggregate all query results and send them to the same place. For instance, to launch GATK variant calling, I want to find all the Ubams for a sample and send them all to the job launcher as a group, because they will all be used as input to the same job.
In other cases, I'll want to split results. In the case where I want to run Vcfstats on all GVCFs, I'll want to run a query to find all GVCFs without Vcfstats data and then split the results to send each GVCF to a separate job-launcher instance to initiate a Vcfstats job.
My initial conceptualization for how to do this was to send the entire result to the QueryResponseWriter class and then have it generate a message with the aggregate result or split the results into a series of messages. However, this required transforming the Graph structure into a dictionary and then splitting either nodes OR relationships. It also slightly changed the graph structure. In a neo4j.Graph instance each node is represented by a instance of the neo4j.Node class with all its defining attributes. The neo4j.Graph relationships include the relationship type, properties, and references to the nodes. So if you remove the nodes from the neo4j.Graph the relationships in the graph just contain empty references. With my initial model, when splitting results, you could only delivery nodes or relationships, so the node definitions would be loaded directly into the relationship dict. This way, each relationships becomes an atomic structure, rather than being dependent on referencing other instances. The problem with this is that the data in the dictionary now does not match the neo4j.Graph structure, breaking the loop of Graph -> JSON -> Graph. Because it doesn't fit a Graph structure, you can't use the neo4j.Graph class to validate the graph data. To perform validation I would have to write separate test functions and then run them every time, etc. It sucks.
Instead of getting all query results from the Neo4j driver session and then implementing custom logic to do aggregation or splitting, I'll let the Neo4j driver handle it. Depending on whether the trellis.DatabaseQuery.aggregate_results attribute is true
or false
, I'll get the aggregated results as a neo4j.Graph
or split results into individual neo4j.Record
objects.
The neo4j.Graph
has node and relationship instances, mirroring the graph. What do record instances look like? Will they represent nodes and relationships? Will they fit the rest of the Trellis model?
In order to write repeatable tests, I want to be able to hydrate records. The neo4j
test class Records includes a method to hydrate records: https://github.com/neo4j/neo4j-python-driver/blob/a8a2b04da36662f0e8d3a0a2c3a9af6e7b9d2f2a/tests/unit/sync/work/test_result.py#L58. It looks like it just uses stock hydration methods, so I'm wondering if I can just pass hydration strings for nodes and relationships and it will turn them into records.
Also, I need to provide fields and records to initialize. I think fields are metadata fields which are disconnected from a node or relationship. For instance if I ran a query that looked like MATCH (f:Fastq) RETURN f.size AS size, f
, it would return size as an integer field and f as a node. I don't plan on handling fields, so I think I can just provide an empty list.
Example statement for initializing a Records object: records = Records(["x"], records))
.
The Records test class doesn't actually generate neo4j.Record
objects, it just generates tuples with the record values. It looks like records are generated by the neo4j.Result
class.
graph TD
neo4j[(Neo4j database)] -- aggregate --> grf(neo4j.Graph)
neo4j -- split --> rec1(neo4j.Record)
grf -- trellis.graphToJson --> grfjson[Graph JSON]
rec1 -- trellis.recordToJson --> grfjson
grfjson --> response(trellis.QueryResponse)
response -- trellis.QueryResponse.convert_to_json --> respjson[Query response JSON]
respjson --> pubsub{Cloud Pub/Sub}
pubsub --> respjson2[Query response JSON]
respjson2 --> response2(trellis.QueryResponse)
response2 -- jsonToGraph --> grf2(neo4j.Graph)
In that model the graph and query response types are represented multiple times to indicate instances of those types are existing in separate serverless functions, communicating via Cloud Pub/Sub.
If we only consider unique types, we can see a loop between neo4j.Graph and Pub/Sub. And if we expanded this diagram out to include all Trellis operations, we could see the eventually the downstream results of neo4j.Graph objects are fed back into the Neo4j database, completing all three loops.
graph TD
neo4j[(Neo4j database)] -- aggregate --> grf(neo4j.Graph)
neo4j -- split --> rec1(neo4j.Record)
grf -- trellis.graphToJson --> grfjson[Graph JSON]
rec1 -- trellis.recordToJson --> grfjson
grfjson -- __init__ --> response(trellis.QueryResponse)
response -- trellis.QueryResponse.convert_to_json --> respjson[Query response JSON]
respjson --> pubsub{Cloud Pub/Sub}
pubsub --> respjson
respjson -- json.loads --> response
response -- trellis.QueryResponse.graph --> grfjson
grfjson -- trellis.jsonToGraph --> grf
Everything that begins with a lowercase letter describes a programming type or method.
I need to refactor the message classes because they don't make sense. There is no QueryResponse class, only Reader and Writer classes. But the thing I want to interact with is the response.