va-big-data-genomics / trellisdata

Python package with classes and methods used in implementing a Trellis data management system.
MIT License
0 stars 1 forks source link

Refactor message classes to be more intuitive #6

Open pbilling opened 1 year ago

pbilling commented 1 year ago

I need to refactor the message classes because they don't make sense. There is no QueryResponse class, only Reader and Writer classes. But the thing I want to interact with is the response.

pbilling commented 1 year ago

I think a problem now is that the message classes were overfit to deal with the use case of triggers and now don't fit the use case of launching jobs. Better approach might be to simplify the message classes and put more message parsing logic in the trigger function.

pbilling commented 1 year ago

One of the challenges is that the trigger function is designed to respond to either a node or a relationship, but the way that Neo4j return results for relationships is by returning the nodes and relationships because then the relationships can just reference the nodes.

Current QueryResponseWriter methods for sharing results is (a) sharing all the nodes or (b) sharing one relationship per message. So yeah, a thing I could do would be to increase the complexity of the triggers logic.

pbilling commented 1 year ago

Understanding the 'aggregate_results' field

The DatabaseTrigger class has an aggregate_results boolean attribute. If true, db-query will return all nodes returned by the query, in a single message. If false, it will call the QueryResponse.generate_separate_entity_jsons() method.

One issue here is the vagueness of the term aggregate_results. One interpretation is that it packages all nodes and relationships into a single message, but that's not the case. It packages all the nodes into a single message but omits any relationships. I know this because the method is called return_json_with_all_nodes() (and I looked at the code), so at least I named something decently. The generate_separate_entity_jsons() method will either return each relationship (with node information) in a separate message OR each node in a separate message. Again, it only returns one kind of entity (node or relationship) based on the value of the DatabaseQuery.pattern attribute ['node', 'relationship'].

pbilling commented 1 year ago
classDiagram
    class QueryResponse{
        + nodes (list)
        + relationships (list)
    }
    class QueryResponseWriter{
        +str message_kind
        +str query_name
        +dict result_summary
        +neo4j.Graph graph
        +list nodes
        +list relationships
        +list supported_patterns
        +str job_request
        +return_json_with_all_nodes()
        +RETURN_JSON_WITH_ALL_ENTITIES()
        +generate_separate_entity_jsons()
        +_get_result_summary_dict()
        +_get_relationship_dict()
        +_get_node_dict()
    }
pbilling commented 1 year ago

Model "Neo4j"

Instead of writing classes to transform the neo4j.Graph and neo4j.ResultSummary outputs, the classes should just be able to transform them to structured text and then reconstitute them.

Use cases:

flowchart TD
    neo4j(neo4j) --> graf[neo4j.Graph]
    neo4j --> summary[neo4j.ResultSummary]
    graf --> writer[QueryResponseWriter]
    summary --> writer
    writer --> json[JSON]
    json --> reader[QueryResponseReader]
    reader --> graf2[neo4j.Graph]
    reader --> summary2[neo4j.ResultSummary]
pbilling commented 1 year ago

Model "neo4j" resources

Easy part will be converting Neo4j objects to JSON, hard part will be converting back to Neo4j objects. The Neo4j Python driver uses "Hydrator" classes to create instances of Node, Relationship and Graph classes.

Example of using hydrators to create a node and relationship: https://github.com/neo4j/neo4j-python-driver/blob/5.0/tests/unit/common/codec/hydration/v2/test_graph_hydration.py.

I feel like it's more intuitive to interact with the v4.4 hydrator: https://github.com/neo4j/neo4j-python-driver/blob/4.4/neo4j/graph/__init__.py.

Maybe not though, it just looks more complicated in v5.0: https://github.com/neo4j/neo4j-python-driver/blob/a8a2b04da36662f0e8d3a0a2c3a9af6e7b9d2f2a/src/neo4j/_codec/hydration/v1/hydration_handler.py#L57.

Hydration tests in v5.0: https://github.com/neo4j/neo4j-python-driver/blob/5.0/tests/unit/common/codec/hydration/v2/test_graph_hydration.py.

The v2 TestGraphHydration imports v1 TestGraphHydration imports HydrationHandlerTestBase which is where the hydration_scope pytest fixture comes from.

And this is where you can see where hydration_scopecomes from: https://github.com/neo4j/neo4j-python-driver/blob/a8a2b04da36662f0e8d3a0a2c3a9af6e7b9d2f2a/tests/unit/common/codec/hydration/_common.py#L45.

The HydrationScope definition: https://github.com/neo4j/neo4j-python-driver/blob/5.0/src/neo4j/_codec/hydration/_common.py.

Here is where the graph hydration function are defined: https://github.com/neo4j/neo4j-python-driver/blob/a8a2b04da36662f0e8d3a0a2c3a9af6e7b9d2f2a/src/neo4j/_codec/hydration/v1/hydration_handler.py#L60.

pbilling commented 1 year ago

Model "neo4j" roadmap

I can use the code I wrote for testing graph->json, with hydrators, for implementing translate_json_to_graph().

pbilling commented 1 year ago

Now that I have functions to translate Graph to JSON and back, I still need

  1. Function(s) to handle the ResultSummary
  2. To reimagine the model for organizing Trellis messages. Right now query requests are still managed by their own set of classes.

There are now only (2) types of Trellis messages

QueryResponse(sender, seed_id, previous_event_id, query_name, result_summary_json, graph_json, job_request) QueryRequest(sender, seed_id, previous_event_id, query_name, query_parameters, custom, cypher, write_transaction, aggregate_results, publish_to, returns)

pbilling commented 1 year ago

Use helper functions to manage different input data structures to the same class

Each kind of message had its own pair of reader/writer classes because the Trellis message classes were designed to take input directly from the systems providing the data: the Neo4j and Pub/Sub drivers. Two issues with this were

  1. I needed (2) classes to initialize data from different structures
  2. My classes were directly integrated with data sources from other systems. In order to expand Trellis to support other data provides such as a different database or different message broker I would need to write new classes.

A general design principle I am drawing from this is to keep at least 1 degree of separation from my systems and systems owned by different organizations. Instead of having my classes take their inputs directly, I can write helper functions to translate data from other systems into structures that fit my classes. When I want to expand Trellis to work on other platforms or with other databases, I can still (potentially) use the same classes, just write new helper functions. This should also simplify the Trellis class structure because instead of having separate reader/writer classes (e.g. QueryResponseReader and QueryResponseWriter) I can just use the same one consistently (e.g. QueryResponse).

This approach also supports a more intuitive mental model. A thing that kept tripping me up when thinking about Trellis messages is that none of the classes actually represented the message, they just represented the act of writing of reading the message. Which doens't really fit an "object-oriented" approach in my mind, since the programming objects I was creating represented actions/verbs rather than objects/nouns.

Update: Caveat to this is when using industry standards. Using an existing standard is probably better than inventing a new approach. Thought of this when looking at how cloud functions use CloudEvents.

pbilling commented 1 year ago

Should I use neo4j classes to model query results?

I just argued that as design principle, my code should be abstracted away from third party systems. For instance, not passing neo4j class instances directly as arguments to initialized my class instances. Instead, first convert them to a system agnostic format (JSON) and then load them. But then, should I still write my functions to interact with neo4j class instances?

Right now, the model I'm envisioning is that Neo4j query results will be translated to JSON for transmission from one serverless function to another. When they are received, they will be reconstituted into neo4j class instances and that's how the function will interact with them.

My thinking with doing it this way is that consistently modelling graph data the same way will make it easier for developers to understand and model mentally. Also, they can just reference the neo4j docs to understand the data structure. A potential drawback is that we are tying our system to another system, even though functions other than db-query will not interact with the database, so using the neo4j structures is purely for consistency of form rather than function. But if we don't keep updating our code to be consistent with the latest version of the neo4j driver, it could also cause confusion if the data structures change. We may also find that a different structure would better fit our use cases.

And if we move to a different database, it might be weird and confusing to use the neo4j library to model data coming from a different source. I think for now, there's no reason I need to translate JSON graphs back into neo4j structures, so downstream functions could just load the JSON into a dictionary or use the translate_json_to_graph() function to translate back to neo4j, if necessary. I liked the idea of completing the circle of graph -> JSON -> graph, but that's mostly just useful for testing and validation. Not sure if there is a further use case.

flowchart TD
    neo4j(neo4j) --> graf[neo4j.Graph]
    neo4j --> summary[neo4j.ResultSummary]
    graf -- trellis.translate_graph_to_json --> jsonGraph[json_graph]
    summary -- trellis.translate_summary_to_json --> jsonSummary[json_summary]
    jsonGraph --> qr1[trellis.QueryResponse]
    jsonSummary --> qr1[trellis.QueryResponse]
    qr1 -- Pub/Sub --> qr2[trellis.QueryResponse]
    qr2 -- json.loads --> Dict
    qr2 -- trellis.translate_json_to_graph --> neo4j.Graph
pbilling commented 1 year ago

Roadmap continued

Added placeholder test fortransate_result_summary_to_json() because I couldn't find much teting information and it's non-essential (https://github.com/neo4j/neo4j-python-driver/blob/4.4/tests/unit/work/test_result.py#LL340C2-L340C2)

pbilling commented 1 year ago

New core architecture diagram

graph TD
    gcs[Cloud Storage] -- CloudEvent --> create-blob-node
    create-blob-node -- trellis.QueryRequest --> db-query
    db-query -- trellis.QueryResponse --> check-triggers
    check-triggers -- trellis.QueryRequest --> db-query
    db-query -- trellis.QueryResponse --> job-launcher
    job-launcher -- trellis.QueryRequest --> db-query
    job-launcher -- dsubCommand --> api[Life Sciences API]
    api -- dataObject --> gcs
pbilling commented 1 year ago

Checklist: Integrate new message classes with core functions

pbilling commented 1 year ago

Aggregating or splitting results

Overview

In some cases, after running a query, I will want to aggregate all query results and send them to the same place. For instance, to launch GATK variant calling, I want to find all the Ubams for a sample and send them all to the job launcher as a group, because they will all be used as input to the same job.

In other cases, I'll want to split results. In the case where I want to run Vcfstats on all GVCFs, I'll want to run a query to find all GVCFs without Vcfstats data and then split the results to send each GVCF to a separate job-launcher instance to initiate a Vcfstats job.

Initial method

My initial conceptualization for how to do this was to send the entire result to the QueryResponseWriter class and then have it generate a message with the aggregate result or split the results into a series of messages. However, this required transforming the Graph structure into a dictionary and then splitting either nodes OR relationships. It also slightly changed the graph structure. In a neo4j.Graph instance each node is represented by a instance of the neo4j.Node class with all its defining attributes. The neo4j.Graph relationships include the relationship type, properties, and references to the nodes. So if you remove the nodes from the neo4j.Graph the relationships in the graph just contain empty references. With my initial model, when splitting results, you could only delivery nodes or relationships, so the node definitions would be loaded directly into the relationship dict. This way, each relationships becomes an atomic structure, rather than being dependent on referencing other instances. The problem with this is that the data in the dictionary now does not match the neo4j.Graph structure, breaking the loop of Graph -> JSON -> Graph. Because it doesn't fit a Graph structure, you can't use the neo4j.Graph class to validate the graph data. To perform validation I would have to write separate test functions and then run them every time, etc. It sucks.

New method

Instead of getting all query results from the Neo4j driver session and then implementing custom logic to do aggregation or splitting, I'll let the Neo4j driver handle it. Depending on whether the trellis.DatabaseQuery.aggregate_results attribute is true or false, I'll get the aggregated results as a neo4j.Graph or split results into individual neo4j.Record objects.

Question: what do records look like?

The neo4j.Graph has node and relationship instances, mirroring the graph. What do record instances look like? Will they represent nodes and relationships? Will they fit the rest of the Trellis model?

pbilling commented 1 year ago

Develop translate_record_to_json()

In order to write repeatable tests, I want to be able to hydrate records. The neo4j test class Records includes a method to hydrate records: https://github.com/neo4j/neo4j-python-driver/blob/a8a2b04da36662f0e8d3a0a2c3a9af6e7b9d2f2a/tests/unit/sync/work/test_result.py#L58. It looks like it just uses stock hydration methods, so I'm wondering if I can just pass hydration strings for nodes and relationships and it will turn them into records.

Also, I need to provide fields and records to initialize. I think fields are metadata fields which are disconnected from a node or relationship. For instance if I ran a query that looked like MATCH (f:Fastq) RETURN f.size AS size, f, it would return size as an integer field and f as a node. I don't plan on handling fields, so I think I can just provide an empty list.

Example statement for initializing a Records object: records = Records(["x"], records)).

Update

The Records test class doesn't actually generate neo4j.Record objects, it just generates tuples with the record values. It looks like records are generated by the neo4j.Result class.

pbilling commented 1 year ago

Workflow describing methods and data structure used for grouping database query results

graph TD
    neo4j[(Neo4j database)] -- aggregate --> grf(neo4j.Graph)
    neo4j -- split --> rec1(neo4j.Record)
    grf -- trellis.graphToJson --> grfjson[Graph JSON]
    rec1 -- trellis.recordToJson --> grfjson
    grfjson --> response(trellis.QueryResponse)
    response -- trellis.QueryResponse.convert_to_json --> respjson[Query response JSON]
    respjson --> pubsub{Cloud Pub/Sub} 
    pubsub --> respjson2[Query response JSON]
    respjson2 --> response2(trellis.QueryResponse)
    response2 -- jsonToGraph --> grf2(neo4j.Graph)

In that model the graph and query response types are represented multiple times to indicate instances of those types are existing in separate serverless functions, communicating via Cloud Pub/Sub.

If we only consider unique types, we can see a loop between neo4j.Graph and Pub/Sub. And if we expanded this diagram out to include all Trellis operations, we could see the eventually the downstream results of neo4j.Graph objects are fed back into the Neo4j database, completing all three loops.

graph TD
    neo4j[(Neo4j database)] -- aggregate --> grf(neo4j.Graph)
    neo4j -- split --> rec1(neo4j.Record)
    grf -- trellis.graphToJson --> grfjson[Graph JSON]
    rec1 -- trellis.recordToJson --> grfjson
    grfjson -- __init__ --> response(trellis.QueryResponse)
    response -- trellis.QueryResponse.convert_to_json --> respjson[Query response JSON]
    respjson --> pubsub{Cloud Pub/Sub} 
    pubsub --> respjson
    respjson -- json.loads --> response
    response -- trellis.QueryResponse.graph --> grfjson
    grfjson -- trellis.jsonToGraph --> grf

Everything that begins with a lowercase letter describes a programming type or method.