marklogic-community / marklogic-state-conductor

An event-based state-machine engine for manipulating MarkLogic database documents.
Other
1 stars 4 forks source link
marklogic orchestrator

MarkLogic State Conductor

The MarkLogic State Conductor (MLSC) allows a developer or architect to define state machines which govern how data moves through a set of MarkLogic Data Hub Steps, and optionally through other custom processing actions. MLSC state machines are defined using a subset of Amazon States Language (ASL). Actions to run a DHF Step or Flow are included, and any other state actions can be defined using server-side modules.

The State Conductor can be used to perform an arbitrary number of actions, in any order, and with branching or other logic based on document content, or context that is passed from state to state. Actions could include: invoking a Data Hub flow, transforming a document, applying metadata, manipulating or querying side-car documents, or invoking a non-DHF process. On premise, these actions can include calling out to another process via HTTP or posting to an event queue.

The State Conductor requires a Driver to process documents and move them through the installed state machines' states. The State Conductor supports a Data Services driver, and a CoRB2 driver.


The State Conductor allows a division of labor among different personas, where business analysts or architects analyze and define the overall flow of data through the system, developers convert that to a state machine configuration, and MarkLogic experts define the DHF Steps or other processes to perform on each state transition. Conversely, architects can define state machines which are then consumed and discussed by less-technical experts with business knowledge.

MLSC uses a variant of Amazon States Language as inspiration for the state machine definition files, so it is familiar to some AWS users and is flexible in the same ways AWS States Language is flexible.

In addition to defining a flexible set of states and transitions via state machines, MLSC ensures that the state of every record is tracked and managed via state-oriented metadata in “execution documents.”

Should you use MLSC for your project? See Applicability


  1. Quick Start Guide
  2. Installation
  3. Usage
  4. State Machine Definitions
  5. State Machine Scope
  6. State Machine Actions
  7. Execution Documents
  8. Provenance
  9. Services
  10. Executions Service
  11. State Machines Service
  12. Status Service
  13. Drivers
  14. Applicability
  15. Roadmap

Installation

Prerequisites:

  1. MarkLogic 10.0-6+
  2. ml-gradle 3.14.0+

The State Conductor is distributed as an mlBundle for ml-gradle projects. To add the State Conductor to your project, add the following dependency to your ml-gradle project:

dependencies {
  mlBundle "com.marklogic:marklogic-state-conductor:1.2.2"                  // to use a published version
  mlBundle files("${projectDir}/lib/marklogic-state-conductor-1.2.2.jar")   // to include locally in your project
}

Usage

Any documents created or modified having the state-conductor-item collection will trigger processing by the State Conductor. They will be evaluated against the context of all installed State Machine Definitions. For each matching State Machine Definition an Execution document will be created corresponding to the matching state machine and triggering document. A property will be added to the triggering document's metadata indicating the Execution file's id:

<state-conductor-execution stateMachine-name="state-machine-name" execution-id="ec89d520-e7ec-4b6b-ba63-7ea3a85eff02" date="2019-11-08T17:34:28.529Z" />

NOTE: Document modifications during, or after the competion of a State Machine will not cause that document to be reprocessed by that same state machine. To manually run a State Machine on a document that it has already been processed by requires manual invokation of the Jobs Service.

State Machine Definitions

State Machine definition files define the states that documents will transition through. States can perform actions (utilizing SJS modules in MarkLogic), performing branching logic, or terminate processing. State Machine definition files are json formatted documents within the application's content database; they should have the "state-conductor-state-machine" collection, and have the ".asl.json" file extension.

Example State Machine Definition File:

{
  "Comment": "sets some property values",
  "mlDomain": {
    "context": [
      {
        "scope": "directory",
        "value": "/test/"
      },
      {
        "scope": "collection",
        "value": "test"
      }
    ]
  },
  "StartAt": "set-prop1",
  "States": {
    "set-prop1": {
      "Type": "Task",
      "Comment": "initial state of the flow",
      "Resource": "/state-conductor/actions/common/examples/set-prop1.sjs",
      "Parameters": {
        "foo": "bar"
      },
      "Next": "set-prop2"
    },
    "set-prop2": {
      "Type": "Task",
      "End": true,
      "Comment": "updates a property",
      "Resource": "/state-conductor/actions/common/examples/set-prop2.sjs"
    }
  }
}

State Machine Scope

State Machine Definition files must define a context within an mlDomain property under the definition file's root. This context defines one or more scopes for which matching documents will have this state machine automatically applied.

Example:

"mlDomain": {
  "context": [
    {
      "scope": "collection",
      "value": "my-collection"
    },
    {
      "scope": "directory",
      "value": "/my/directory/"
    },
    {
      "scope": "query",
      "value": "{\"andQuery\":{\"queries\":[{\"collectionQuery\":{\"uris\":[\"test\"]}}, {\"elementValueQuery\":{\"element\":[\"name\"], \"text\":[\"John Doe\"], \"options\":[\"lang=en\"]}}]}}"
    }
  ]
}

Valid scopes are collection, directory, and query. For scopes of type query, the value must be a string containing the JSON serialization of a cts query.

State Machine Actions

State machine States of the type "Task" can define actions to perform on in-process documents. These actions take the form of Server-Side Javascript modules referenced by the "Resource" property. Action modules can perform custom activities such as updating the in-process document, performing a query, invoking an external service, etc. Action modules should export a "performAction" function with the following signature:

'use strict';

function performAction(uri, parameters = {}, context = {}) {
  // do something
}

exports.performAction = performAction;

Where uri is the document being processed by the flow; parameters is a json object configured via this State's "Parameters" property; and context contains the current in-process State Machine's context. Any data returned by the performAction function will be stored as the in-process state machine's new context object.

Execution Documents

For every document processed by a State Conductor state machine there is a corresponding Execution document. Execution documents are stored in the state-conductor-executions database, in the /stateConductorExecution/ folder. These documents track the in-process document, and state machine status; they also store the state machine's context and provenance information.

Provenance

Every time a document starts, stops, or transitions from one state to another within a state machine, the Provenance information stored in the Execution document is updated.


Services

The State Conductor includes MarkLogic REST service extensions for managing Flow files and State Conductor Jobs.

Executions Service

Start one or more State Conductor Executions:

PUT /v1/resources/state-conductor-executions?rs:uris=</my/documents/uri>&rs:name=<state-machine-name>

Get the execution id for the given document and state machine:

GET /v1/resources/state-conductor-executions?rs:uri=</my/documents/uri>&rs:name=<state-machine-name>

State Machine Service

List the installed State Conductor state machines:

GET /v1/resources/state-conductor-state-machines?rs:name=<state-machine-name>

Install a State Conductor State Machine definition:

PUT /v1/resources/state-conductor-state-machines?rs:name=<state-machine-name>

Remove an installed State Conductor State Machine definition:

DELETE /v1/resources/state-conductor-state-machines?rs:name=<state-machine-name>

Status Service

List the status of the given State Conductor State Machine:

GET /v1/resources/state-conductor-status?rs:name=<state-machine-name>&rs:startDate=<xs.dateTime>&rs:endDate=<xs.dateTime>

New (optional) temporal parameters startDate and endDate in v0.3.0.


Drivers

The State Conductor utilizes a "Driver" to process documents; moving them through the installed state machines' states in the prescribed order.

It is simple to “drive” MLSC:

  1. Get a set of execution document URIs that represent data not yet in a final state using the getExecutions data service.
  2. For each of these documents, make a request to the processExecution data service, which takes one URI and advances that execution to the next State per the DHF Step or other process.
  3. Repeat this forever.

For convenience, two drivers are included: one using corb, and one written in Java that executes the above data services. The responsiblity of the Driver in MLSC is only to determine which state machines to run and with how many threads. Which Steps are run in what order, when to retry, and other logic is the province of the state machine definition itself.

Note that MLSC does not use the DHF built-in libraries as drivers to run DHF Steps. Those libraries execute each Step’s sourceQuery, and MLSC does not use sourceQuery configurations to determine what steps to run. It uses state machines, defined declaratively as JSON files. This is a rather different paradigm, which offloads much of the logic from the callers, and is why the “drivers” for MLSC are extremely simple.

For more information see Drivers.


Applicability

When to use the MarkLogic State Conductor?

Comparison and Usage Guidance vs Other Tools


Roadmap

See Enhancements