marklogic-community / marklogic-nifi-incubator

A collaboration space for processors, recipes, templates, etc. NOTE: improvements made to the connector in this project have been incorporated into the MarkLogic NiFi repository (https://github.com/marklogic/nifi).
Apache License 2.0
4 stars 11 forks source link

Prevent XDMP-CONFLICTINGUPDATES in PutMarkLogic #25

Closed dmcassel closed 4 years ago

dmcassel commented 5 years ago

When PutMarkLogic's batch size is greater than one, there is the potential for XDMP-CONFLICTINGUPDATES errors if more than one flow file has the same URI (I'm working with a client for whom this happens a lot). For large batches, it can take a while for the batch to fail, but it would be pretty simple detect the duplicate URIs and address it, which would allow for faster processing and lower load on MarkLogic.

I'd like to propose a property DuplicateURIHandling, with the following values:

fsnow commented 5 years ago

There's one more case, where you want both duplicate documents to be ingested because they represent a stream of history (bitemporal?). In that case, you want the current batch to be sent, then a new one opened to include the latest document. In any case, we should prevent XDMP-CONFLICTINGUPDATES. I would also like to prevent the situation where the same URI is inserted in different batches at the same time, leading to deadlocks. Better if these changes could be made in DMSDK for a wider benefit, but otherwise we could fix it in PutMarkLogic.

sjiang99 commented 5 years ago

From @rjrudin : It looks like this can be accomplished in the NiFi processors by implementing a BatchListener, which would look at all the URIs to be written in a batch and then determine what to do. Because this construct exists already, I don’t think there’s really a concept of “fixing” this in DMSDK, there’s just a concept of providing an OOTB BatchListener to let the client choose what to do.

For the problem of duplicate URIs in multiple batches – I’d consider that out of scope for now. Just focus on duplicate URIs in the same batch.

I like the ideas that Dave presented (it is a breath of fresh air for me to read a ticket that identifies both a problem and a possible solution, as opposed to just a vaguely worded, non-reproducible problem). They should be abstracted away from NiFi though. The choices are similar though – when 2 or more of the same URI are detected in a batch, the choices could be:

Do nothing and let the server handle it, which may mean throwing an error Eagerly throw an error Note this is a little tricky because there could be multiple duplicate URIs – should the error capture all of them, or just the first one that’s found? If multiple, how to report all of them in the error message? Use the first URI found Use the last URI found Don’t write the batch, and log a message Same caveat as above for reporting multiple duplicate URIs

dmcassel commented 5 years ago

@sjiang99 @rjrudin is this issue likely to get addressed in the near future? It's an on-going problem for one of my clients; if a fix isn't coming down the pipe soon, I'll see about taking this on.

I'm assuming that in order to use a 1.9.x release of these NARs, I need to get my NiFi up to 1.9.x.

sjiang99 commented 4 years ago

@dmcassel @garyvidal I encountered some unit test errors after the PR was merged into ml-develop. The failure occurred on tests that were not changed: ExecuteScriptMarkLogicTest, ExtensionCallMarkLogicTest, PutMarkLogicRecordTest, QueryMarkLogicTest and SSLConnectionTest.. I see as part of the PR, AbstractMarkLogicProcessorTest was changed. Have you guys encountered this issue? We are are working on a new release now. The exception is: java.lang.IllegalStateException: Controller service DefaultMarkLogicDatabaseClientService[id=databaseClientService] cannot be modified because it is not disabled