marklogic / nifi

Mirror of Apache NiFi to support ongoing MarkLogic integration efforts
https://marklogic.github.io/nifi/
Apache License 2.0
12 stars 23 forks source link

trans:name dynamic property not working with PutMarklogic #195

Closed rishabh208gupta closed 1 year ago

rishabh208gupta commented 1 year ago

trans:name which can be added as a dynamic parameter to pass to the transform module in PutMarklogic processor is not being evaluated properly, its being sent as empty string if we specify a el expression ${mscNumber}, if however a hardcoded string is used, it is sent across properly.

rjrudin commented 1 year ago

Hi @rishabh208gupta - the expression should work if "mscNumber" is defined in the NiFi variable registry. It will not be evaluated against FlowFile attributes, per the documentation (available via "View Usage" on PutMarkLogic):

image

The reason for this is that the transform object is built once when PutMarkLogic is started, and thus it can only use the variables in the NiFi registry. It's not possible to then modify it for each FlowFile.

Do you have a need for values to be sent to the transform for each FlowFile?

rishabh208gupta commented 1 year ago

Hi @rjrudin , sorry my bad, yeah I get it now. My requirement is to sent a value from FlowFile attributes, so a different value for each FlowFile. Is there anything that can be done for this, thanks.

rjrudin commented 1 year ago

Could you add those attributes to the body of the FlowFile instead? For example, if the FlowFiles you're sending to PutMarkLogic contain JSON which is then written as a JSON document to MarkLogic, you could use an ExecuteScript processor to add the attribute values as new keys to your JSON document. Your REST transform would then have access to that data and could do whatever it needs to with it.

rishabh208gupta commented 1 year ago

The FlowFile is a binary file, it could be .mp4, .png, .pdf, etc

rjrudin commented 1 year ago

Got it - and are you looking to include metadata about the binary file that either gets persisted as document properties or metadata or as a separate document?

rishabh208gupta commented 1 year ago

The requirement is to pass in a parameter which is different for each file to the transform module, in the transform module we are using that param to be used in a separate document that we are creating and ingesting. Its like a log we are creating about the metadata of the incoming Flowfile, this meta data file needs to have the param as one of its fields.

rjrudin commented 1 year ago

Could you use a meta: or property: attribute to stash the document-specific value as a metadata key or a document property to achieve the same effect? It at least allows you to get that value to MarkLogic. I don't think the REST transform gives you access to the metadata for a document, but a technique I've used in the past is a pre-commit trigger, which will have access to all parts of the URI. At that point, you could write a little bit of trigger code to fetch the document metadata keys or properties from the URI (the binary you're writing) and use them to insert another document.

rishabh208gupta commented 1 year ago

Hi @rjrudin, we wouldn't want the performance implications and the additional complexity of including a pre-commit trigger. Is there any reason why trans: was made only to read variables from the variable registry and not from expression language? Wouldn't it be better if it was made to read from both?

rjrudin commented 1 year ago

The issue is that PutMarkLogic uses MarkLogic's WriteBatcher component for writing batches of documents in multiple threads, and WriteBatcher requires a ServerTransform to be configured on it before it begins. That transform is then applied to all documents in all batches. It's a good fit for when the same transform (with the same parameters) is intended to be applied against all documents, but that doesn't meet your requirements.

I am wondering if it might be simpler to just use NiFi's InvokeHttp processor to access MarkLogic's /v1/documents endpoint directly. The advantage with that is you can specify a transform and transform parameters specific to the document you're ingesting. The downside of not having the multi-threaded batch support of PutMarkLogic can likely be mitigated by configuring InvokeHttp itself to run with multiple threads.

rjrudin commented 1 year ago

Another approach you can take here, in addition to using InvokeHttp as mentioned above, is to use either CallRestExtensionMarkLogic or ExecuteScriptMarkLogic to insert a binary document per FlowFile, with FlowFile-specific parameters either being sent to the REST extension or incorporated in the script.

Unfortunately, because PutMarkLogic requires a single transform with a single set of transform parameters to be used, it's not possible to achieve your requirements with PutMarkLogic. As a result, I am going to close this ticket, but please reply back if you run into any issues with one of the 3 recommended approaches here.