va-big-data-genomics / trellisdata

Python package with classes and methods used in implementing a Trellis data management system.
MIT License
0 stars 1 forks source link

JobRequest: Figure out how to handle different patterns: unconnected nodes vs related nodes. #5

Open pbilling opened 1 year ago

pbilling commented 1 year ago

The JobLauncher() should support two types of inputs: a pair of related nodes (e.g. fastq-to-ubam where two related Fastqs are used as input) and a set of unconnected nodes (e.g. 'gatk-5-dollar` where a set of unrelated Ubams are used as input.

They aren't unrelated though, they are related, but the parent node of the Ubams is not relevant to the job.

Also, they aren't directly related to a single parent node, they are ~4 relationships removed from a common parent.

A broader design question is: should relationship pattern be presented to the job launcher even if both ends of the relationship are not directly relevant for the job? My initial thought was "no" because I didn't see an obvious functional justification. But I only really looked at (2) use cases: fastq-to-ubam and gatk-5-dollar. Maybe I should expand my investigation.

pbilling commented 1 year ago

The gatk-5-dollar one is a weird one, because it's not a traditional dsub job, dsub just creates the Cromwell manager node which reads a separate set of config files to determine inputs/outputs etc.

So, for instance, the Ubam paths are specified as a list in the config file. Right now this config file is generated by the launch-gatk-5-dollar function, but in the new model, it would be generated by the dsub script, so I would have to figure out how to pass the paths of each node, as a list, to a dsub variable.

Sigh.

I'd have to support like list comprehension in my template? That sounds heinous.

pbilling commented 1 year ago

Roadmap

pbilling commented 1 year ago

Sources prototype

sources:
  -
    name: fastq_mate_pair
    graph_status: relationship
    pattern: 
      start: Fastq
      rel: HAS_MATE_PAIR
      end: Fastq
    cardinality: one-to-one
  -
    name: ubams
    graph_status: independent
    pattern:
      start: Ubam
    cardinality: many
pbilling commented 1 year ago

Prototype source YAML seems fine. Created in trellisdata/tests/sample_inputs/job-sources.yaml.

pbilling commented 1 year ago

Observed cardinality cases: one, one-to-one, many. Because I only have (8) cases, I think I'll just include all of them for testing.

pbilling commented 1 year ago

Implementation notes

pbilling commented 1 year ago

I need to see what it looks like to try to return a path of nodes connected across multiple relationships to see if I could support the use case of (:Sample)-[*]->(:Ubam).

UPDATE: If you return a path the Graph result will include every relationship in that path. Don't think it makes sense for this use case, better option would just be to return Sample and Ubam nodes independently. No need to take into account their relationship for function logic anyway. The relevance for returning the relationship for (:Fastq)-[:HAS_MATE_PAIR]->(:Fastq) was that the order mattered (though I'm actually not sure that's the case).

pbilling commented 1 year ago

For the JobRequest() class I was thinking of doing most of the operations in the init(), but I think to enable unit testing I will move them to a separate function.

pbilling commented 1 year ago

On designing interacting classes

Another thing I was thinking about was to pass the task, trellis_config, and query_response as argument to initialize JobRequest instances.

job_request = JobRequest(task, trellis_config, query_response)

The issue I have with that is that the Trellis configuration and query response aren't really attributes of the job request, they just include information that is useful to running the JobRequest() methods. My thinking being "the more information you give the JobRequest class the less logic has to be applied to programming the code that is creating JobRequest instances".

The other thing I was thinking about is that if I am a developer trying to use the trellisdata package to implement my data management system, the more tightly integrated those classes are, the easier it is to implement them, because they are all designed to work together. So, if the JobRequest class takes a QueryResponse instance as input, because both of these are trellisdata classes, they should always be harmonized and compatible.

But, that doesn't mean the QueryResponse has to be provided to the init function, but it does mean the whole QueryResponse instance should be provided, as opposed to a slice of it (e.g. query_response.nodes). If only a slice is provided and then the structure of the QueryResponse class changes so that the slice is no longer valid, the developer who implemented it will have to change their code. But if the whole instance is provided, it won't matter how the internal class structure changes because all the other classes that are designed to interact with it will also be updated to harmonize with the new structure.

pbilling commented 1 year ago
pbilling commented 1 year ago

I should put a sample QueryResponse in a separate input file so I can import it into different tests and only have to update it one place if the structure changes.