marklogic / nifi

Mirror of Apache NiFi to support ongoing MarkLogic integration efforts
https://marklogic.github.io/nifi/
Apache License 2.0
12 stars 23 forks source link

Feature Suggestion - additional info from metadata #133

Closed DavidEnnis-CleverLlamas closed 2 years ago

DavidEnnis-CleverLlamas commented 2 years ago

In the use of QueryMarkLogic, you can set the option to return metadata with or without the content.

Under the hood, MarkLogic returns the entire payload (metadata-values, collections, permissions, quality, properties)

However, the implementation seems to toss out some of the metadata and only sets

I need additional information from the rapi:metadata payload. FOr the first use-case, I need collections. It would be a shame to have to make a second call for information already provided.

I was was thinking of one of the following:

  1. Add the missing items as attributes (permission:, collection, quality) just like meta: and property: (formats to be considered) AND/OR
  2. add the entire rapi:medatada fragment as an attribute so that at least it is available.

Willing to work on this if there is value in it going back into the main project. -David Ennis

rjrudin commented 2 years ago

Thanks @17llamas - approach 1 above seems like a simple and logical default thing to do. We'll get this into the next release.

rjrudin commented 2 years ago

@17llamas Let me know how this sounds for exposing collections, permissions, and document quality:

  1. Collections will be added as a "collections" attribute with all collections joined in a comma-delimited string
  2. For each unique role in the set of permissions, a "permission:(role-name)" attribute will be added with the list of capabilities for that role joined in a comma-delimited string - e.g. "permission:my-role" = "read,update"
  3. The document quality will be added to a "quality" attribute

We are considering adding an "ml-" prefix to each of these, though we initially won't touch the "meta:" and "property:" prefixes. That would help ensure uniqueness for these FlowFile attributes so that they don't collide with existing attributes.

rjrudin commented 2 years ago

Some logging (via the LogAttribute processor) showing all the metadata for some test documents:

-------------------QUERY RESULT-------------------
FlowFile Attribute Map Content
Key: 'filename'
        Value: '/PutMarkLogicTest/20.xml'
Key: 'marklogic-collections'
        Value: 'QueryMarkLogicTest-2,QueryMarkLogicTest,test1'
Key: 'marklogic-permissions'
        Value: 'rest-writer,update,rest-reader,read,rest-reader,execute'
Key: 'marklogic-quality'
        Value: '12'
Key: 'meta:meta1'
        Value: 'hello1'
Key: 'meta:meta2'
        Value: 'hello2'
Key: 'meta:my-uri'
        Value: '/PutMarkLogicTest/20.xml'
Key: 'path'
        Value: './'
Key: 'property:{org:example}hello'
        Value: 'world'
Key: 'uuid'
        Value: '35eb577d-f996-4773-a16a-9c25c67666ac'
-------------------QUERY RESULT-------------------
<?xml version="1.0" encoding="UTF-8"?>
<root><sample>xmlcontent</sample><dateTime xmlns="namespace-test">2000-01-01T00:00:00.000000</dateTime></root>
DavidEnnis-CleverLlamas commented 2 years ago

Hi Rob.

Sorry for the late reply. This looks great.

Also one separate question: is there a purpose in the design choice to not pass attributes downstream that came in from the start? It appears to be the use of creating a session rather than cloning one.

Regards, David Ennis

On Tue, 23 Aug 2022, 21:34 Rob Rudin, @.***> wrote:

@17llamas https://github.com/17llamas Let me know how this sounds for exposing collections, permissions, and document quality:

  1. Collections will be added as a "collections" attribute with all collections joined in a comma-delimited string
  2. For each unique role in the set of permissions, a "permission:(role-name)" attribute will be added with the list of capabilities for that role joined in a comma-delimited string - e.g. "permission:my-role" = "read,update"
  3. The document quality will be added to a "quality" attribute
  4. The entire metadata fragment will be added as a "document-metadata" attribute

We are considering adding an "ml-" prefix to each of these, though we initially won't touch the "meta:" and "property:" prefixes. That would help ensure uniqueness for these FlowFile attributes so that they don't collide with existing attributes.

— Reply to this email directly, view it on GitHub https://github.com/marklogic/nifi/issues/133#issuecomment-1224712361, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABD2VTH7T4HFI2SVT5KEMT3V2URVPANCNFSM52MXSQ2Q . You are receiving this because you were mentioned.Message ID: @.***>

rjrudin commented 2 years ago

@17llamas That's a good question - there are some areas between processors where behavior differs when it seems like it should be the same. For example, I would think that any processor that retrieves one to many items from ML would follow the same original/results pattern, where each FlowFile sent to "results" is a clone of the original FlowFile sent to "original".

I am going to look into this further for 1.16.3.2 to firm up consistency between the processors. Going to get 1.16.3.1 out on Monday to address an SSL bug in RunFlowMarkLogic and then will get a plan together for 1.16.3.2.

DavidEnnis-CleverLlamas commented 2 years ago

HI Rob

A few notes:

Good that you will look at standardizing the Controllers a bit. I have gone through each line-by-line and it looks like they are created at different times by different people - and in some cases, for certain specific use-cases. This is clear when you look at the rows endpoint where very few of the options of the API are available to configure (so in my case, I use the eval endpoint and run the optic query from there).

Regarding no passing upstream flow attributes as is the case with QueryML, I have opened a separate item for that since it has it's own defined problem statement.

rjrudin commented 2 years ago

Will be addressing the properties issue in the next release.