The MarkLogic State Conductor (MLSC) allows a developer or architect to define state machines which govern how data moves through a set of MarkLogic Data Hub Steps, and optionally through other custom processing actions. MLSC state machines are defined using a subset of Amazon States Language (ASL). Actions to run a DHF Step or Flow are included, and any other state actions can be defined using server-side modules.
The State Conductor can be used to perform an arbitrary number of actions, in any order, and with branching or other logic based on document content, or context that is passed from state to state. Actions could include: invoking a Data Hub flow, transforming a document, applying metadata, manipulating or querying side-car documents, or invoking a non-DHF process. On premise, these actions can include calling out to another process via HTTP or posting to an event queue.
The State Conductor requires a Driver to process documents and move them through the installed state machines' states. The State Conductor supports a Data Services driver, and a CoRB2 driver.
The State Conductor allows a division of labor among different personas, where business analysts or architects analyze and define the overall flow of data through the system, developers convert that to a state machine configuration, and MarkLogic experts define the DHF Steps or other processes to perform on each state transition. Conversely, architects can define state machines which are then consumed and discussed by less-technical experts with business knowledge.
MLSC uses a variant of Amazon States Language as inspiration for the state machine definition files, so it is familiar to some AWS users and is flexible in the same ways AWS States Language is flexible.
In addition to defining a flexible set of states and transitions via state machines, MLSC ensures that the state of every record is tracked and managed via state-oriented metadata in “execution documents.”
Should you use MLSC for your project? See Applicability
Prerequisites:
- MarkLogic 10.0-6+
- ml-gradle 3.14.0+
The State Conductor is distributed as an mlBundle for ml-gradle
projects. To add the State Conductor to your project, add the following dependency to your ml-gradle project:
dependencies {
mlBundle "com.marklogic:marklogic-state-conductor:1.2.2" // to use a published version
mlBundle files("${projectDir}/lib/marklogic-state-conductor-1.2.2.jar") // to include locally in your project
}
Any documents created or modified having the state-conductor-item
collection will trigger processing by the State Conductor. They will be evaluated against the context of all installed State Machine Definitions. For each matching State Machine Definition an Execution
document will be created corresponding to the matching state machine and triggering document. A property will be added to the triggering document's metadata indicating the Execution
file's id:
<state-conductor-execution stateMachine-name="state-machine-name" execution-id="ec89d520-e7ec-4b6b-ba63-7ea3a85eff02" date="2019-11-08T17:34:28.529Z" />
NOTE: Document modifications during, or after the competion of a State Machine will not cause that document to be reprocessed by that same state machine. To manually run a State Machine on a document that it has already been processed by requires manual invokation of the
Jobs Service
.
State Machine definition files define the states that documents will transition through. States can perform actions (utilizing SJS modules in MarkLogic), performing branching logic, or terminate processing. State Machine definition files are json formatted documents within the application's content database; they should have the "state-conductor-state-machine" collection, and have the ".asl.json" file extension.
Example State Machine Definition File:
{
"Comment": "sets some property values",
"mlDomain": {
"context": [
{
"scope": "directory",
"value": "/test/"
},
{
"scope": "collection",
"value": "test"
}
]
},
"StartAt": "set-prop1",
"States": {
"set-prop1": {
"Type": "Task",
"Comment": "initial state of the flow",
"Resource": "/state-conductor/actions/common/examples/set-prop1.sjs",
"Parameters": {
"foo": "bar"
},
"Next": "set-prop2"
},
"set-prop2": {
"Type": "Task",
"End": true,
"Comment": "updates a property",
"Resource": "/state-conductor/actions/common/examples/set-prop2.sjs"
}
}
}
State Machine Definition files must define a context within an mlDomain
property under the definition file's root. This context defines one or more scopes for which matching documents will have this state machine automatically applied.
Example:
"mlDomain": {
"context": [
{
"scope": "collection",
"value": "my-collection"
},
{
"scope": "directory",
"value": "/my/directory/"
},
{
"scope": "query",
"value": "{\"andQuery\":{\"queries\":[{\"collectionQuery\":{\"uris\":[\"test\"]}}, {\"elementValueQuery\":{\"element\":[\"name\"], \"text\":[\"John Doe\"], \"options\":[\"lang=en\"]}}]}}"
}
]
}
Valid scopes are collection
, directory
, and query
. For scopes of type query
, the value must be a string containing the JSON serialization of a cts query.
State machine States of the type "Task" can define actions to perform on in-process documents. These actions take the form of Server-Side Javascript modules referenced by the "Resource" property. Action modules can perform custom activities such as updating the in-process document, performing a query, invoking an external service, etc. Action modules should export a "performAction" function with the following signature:
'use strict';
function performAction(uri, parameters = {}, context = {}) {
// do something
}
exports.performAction = performAction;
Where uri
is the document being processed by the flow; parameters
is a json object configured via this State's "Parameters" property; and context
contains the current in-process State Machine's context. Any data returned by the performAction function will be stored as the in-process state machine's new context object.
For every document processed by a State Conductor state machine there is a corresponding Execution
document. Execution documents are stored in the state-conductor-executions
database, in the /stateConductorExecution/
folder. These documents track the in-process document, and state machine status; they also store the state machine's context and provenance information.
Every time a document starts, stops, or transitions from one state to another within a state machine, the Provenance information stored in the Execution document is updated.
The State Conductor includes MarkLogic REST service extensions for managing Flow files and State Conductor Jobs.
Start one or more State Conductor Executions:
PUT /v1/resources/state-conductor-executions?rs:uris=</my/documents/uri>&rs:name=<state-machine-name>
Get the execution id for the given document and state machine:
GET /v1/resources/state-conductor-executions?rs:uri=</my/documents/uri>&rs:name=<state-machine-name>
List the installed State Conductor state machines:
GET /v1/resources/state-conductor-state-machines?rs:name=<state-machine-name>
Install a State Conductor State Machine definition:
PUT /v1/resources/state-conductor-state-machines?rs:name=<state-machine-name>
Remove an installed State Conductor State Machine definition:
DELETE /v1/resources/state-conductor-state-machines?rs:name=<state-machine-name>
List the status of the given State Conductor State Machine:
GET /v1/resources/state-conductor-status?rs:name=<state-machine-name>&rs:startDate=<xs.dateTime>&rs:endDate=<xs.dateTime>
New (optional) temporal parameters startDate
and endDate
in v0.3.0.
The State Conductor utilizes a "Driver" to process documents; moving them through the installed state machines' states in the prescribed order.
It is simple to “drive” MLSC:
getExecutions
data service.processExecution
data service, which takes one URI and advances that execution to the next State per the DHF Step or other process.For convenience, two drivers are included: one using corb, and one written in Java that executes the above data services. The responsiblity of the Driver in MLSC is only to determine which state machines to run and with how many threads. Which Steps are run in what order, when to retry, and other logic is the province of the state machine definition itself.
Note that MLSC does not use the DHF built-in libraries as drivers to run DHF Steps. Those libraries execute each Step’s sourceQuery, and MLSC does not use sourceQuery configurations to determine what steps to run. It uses state machines, defined declaratively as JSON files. This is a rather different paradigm, which offloads much of the logic from the callers, and is why the “drivers” for MLSC are extremely simple.
For more information see Drivers.
When to use the MarkLogic State Conductor?
See Enhancements