marklogic / nifi

Mirror of Apache NiFi to support ongoing MarkLogic integration efforts
https://marklogic.github.io/nifi/
Apache License 2.0
12 stars 23 forks source link

Content Variable doesn't encode flow content properly in ExecuteScriptMarkLogic #205

Closed grtjn closed 7 months ago

grtjn commented 1 year ago

We ingest data using ExecuteScriptMarkLogic, as we like to perform custom checks and add ingest metadata while inserting into MarkLogic. We pass flow file contents through to the XQuery code using the Content Variable property. We noticed however that if our flow files contain diacritics, like French, German, Polish names, and addresses often do, then end up garbled in MarkLogic. We checked thoroughly, and came to the conclusion that the ExecuteScriptMarkLogic is not ensuring it gets sent as UTF-8, as MarkLogic is probably expecting.

We are using the MarkLogic NiFi processors v1.16.3.2.

grtjn commented 1 year ago

As a workaround we base64 encode the flowfile contents before passing it into the ExecuteScriptMarkLogic processor, and decoding it inside the XQuery code, executed by the ExecuteScriptMarkLogic processor.

rjrudin commented 1 year ago

Can you check a couple things:

  1. Does a LogAttribute processor log the contents of the flowfile correctly - i.e. the correct diacritics appear?
  2. Can you ensure that the JVM running NiFi has the default encoding set to UTF-8? More info on that at https://www.geeksforgeeks.org/how-to-get-and-set-default-character-encoding-or-charset-in-java/ .

We had a report similar to this recently, and it turned out that the content was already mangled before the MarkLogic processor received the data.

mitchshepherd commented 9 months ago

@grtjn, any word on the previous questions? We'd love to ensure this has been properly addressed.

grtjn commented 9 months ago

Missed the earlier comment, sorry. Let me get back to you about this.

rjrudin commented 7 months ago

@grtjn Going to close this for bookkeeping purposes, but please continue the conversation here if you have results from the questions above - I'm specifically wondering what LogAttribute printed out.