nodestream-proj / nodestream

A Declarative framework for Building, Maintaining, and Analyzing Graph Data
https://nodestream-proj.github.io/docs/
Apache License 2.0
36 stars 11 forks source link

Redesign Nodestream Project Handling #359

Open zprobst opened 1 month ago

zprobst commented 1 month ago

Nodestream projects are currently complex to maintain and limiting. There is duplication and complexity that makes it hard to develop new features. The core parts of the project model were never well thought out but rather reactionary designed as the project evolved. For 1.0, we'd really want to design a project model that leaves the doors open to new features down the road, increases maintainability, and removes some limitations that users have. This issue catalogs the issues as well as a proposed solution.

Current Issues

Reusing Plugin Pipelines

Currently plugins work by adding a scope to the project. This means that the pipelines cannot be used more than once with a specified configuration. See #240

Plugin Development Difficulties

Currently if you are developing a plugin, you have to do some gymnastics to run your pipelines inside of a project which makes it quite difficult to develop.

Using the Same Pipeline at Once

Similar to plugins, you cannot provide a base configuration for a pipeline and use the same definition more than once.

Orchestrating Pipelines in a Particular Order

There is no way to run a "group" of pipelines in a particular order without specifying a specific series of pipelines in a run command.

Current Class Inventory

Lets examine the internal object model of the project:

PipelineConfiguration

Contains targets, annotations, and other data that pertains to how pipelines are treated and initialized.

PluginConfiguration

Largely duplicative with a scope definition. Used to "merge" configuration with the scope defined in the project.

PipelineScope

Defines a group of pipelines and shared configuration of those pipelines.

PipelineDefinition

Stores the file path to the definition as well as configuration specific to that definition.

Project

Contains all scopes and plugins.

Proposed Solution

The proposed solution would change an expanded nodestream.yaml file to look like this.

plugins:
  - type: plugin
    name: nodestream_plugin_sbom # as an example 

pipelines:
   - name: foo
     path: foo/bar/baz.yaml
     config: # optional; same as today. 
       foo: bar
     annotations: # optional; same as today. 
       foo: bar 

scopes:
  - name: crons
    pipelines:
      - !pipeline self/foo # refer to the prototype pipeline described above. 
      - path: !pipeline self/foo # as written, same as above. But can specify overriding config, annotations, etc .
      - foo/bar/baz.yaml # Can inline a pipeline same as before 
      - path: foo/bar/baz.yaml # Same as above.
      - !pipeline nodestream_plugin_sbom/default/github # import a pipeline from a plugin. 

Since you can still define a pipeline in a child scope, there is no breaking changes to projects that are simple by defining pipelines and scopes on their own. This does introduce a change for projects that use plugins. However, it is minimal.

PipelineConfiguration

This class would stay essentially unchanged compared to the current class.

PipelineDefinition -> PipelinePrototype

This issue proposes renaming PipelineDefinition to PipelinePrototype. It would contain the following data:

Scope and Project

Essentially a project is duplicative with the Scope class. With the changes in internal data model, a specific project class is not required. The data model essentially becomes a graph of scopes and this the Project class is just a Scope node in the Graph. Therefore changing the APIs to munge the best of both worlds should be all we need.

angelosantos4 commented 1 month ago
scopes:
  - name: crons
    pipelines:
      - !pipeline self/foo # refer to the prototype pipeline described above. 
      - path: !pipeline self/foo # as written, same as above. But can specify overriding config, annotations, etc .
      - baz/bar/foo.yaml # Path as done before.

I'm guessing this is just to illustrate the way we can alias the pipelines but I would imagine they need to have unique identifiers for the nodestream run {identifier} command. self/foo or self/crons/self/foo or self/crons/foo.