veg / hivclustering

Infer molecular transmission networks from pairwise distance files (part of HIV-TRACE)
3 stars 5 forks source link

Schema-Based Node Keying for Attribute Matching #48

Closed stevenweaver closed 1 month ago

stevenweaver commented 1 month ago

Summary:

This PR refactors the node key construction to use fields defined in the keying section of schema.json. Instead of indexing nodes by the entire id, the key is now built from selected fields, improving flexibility for matching nodes to attributes.

Key Changes:

  1. Key Construction from Schema:

    • Added construct_node_key_from_schema to build node keys using fields listed in keying.fields and keying.delimiter from schema.json.
    • Nodes are now indexed by these constructed keys, rather than the full id.
  2. Schema-Defined Keying:

    • Key fields (e.g., document_uid and lab_seq) and a delimiter are now specified in schema.json under the keying section:
      {
      "keying": {
      "fields": ["document_uid", "lab_seq"],
      "delimiter": "~"
      }
      }
  3. Updated Node Matching:

    • Nodes in network_json are matched by the key constructed from the specified fields rather than the full id.
    • Attribute injection remains unchanged, but now relies on schema-based key matching.
  4. Backward Compatibility:

    • Defaults to existing behavior (e.g., ehars_uid) if no keying fields are specified in the schema.

Why:

Testing: