marklogic / java-client-api

Java client for the MarkLogic enterprise NoSQL database
https://docs.marklogic.com/guide/java
Apache License 2.0
59 stars 72 forks source link

QueryBatcher fails when using path range query #1283

Closed rjrudin closed 3 years ago

rjrudin commented 3 years ago

So we can address your issue, please include the following:

Version of MarkLogic Java Client API

5.3.2

Version of MarkLogic Server

10.0-5

Java version

Java 8 and 11

OS and version

N/A

Input: Some code to illustrate the problem, preferably in a state that can be independently reproduced on our end

Below is a sample program to expose the bug. I have a path range index set up correctly on "/root/nst:dateTime" with "nst" declared as a path namespace in the database. And I have 6 documents that match the query (this is all from a marklogic-nifi test). Using queryManager.search, I get back the expected 6 documents. Using QueryBatcher, I get an error due to the namespace prefix not being recognized.

package org.apache.nifi.marklogic.processor;

import com.marklogic.client.DatabaseClient;
import com.marklogic.client.DatabaseClientFactory;
import com.marklogic.client.datamovement.DataMovementManager;
import com.marklogic.client.datamovement.QueryBatcher;
import com.marklogic.client.io.StringHandle;
import com.marklogic.client.query.QueryManager;
import com.marklogic.client.query.StructuredQueryBuilder;
import com.marklogic.client.query.StructuredQueryDefinition;
import com.marklogic.client.util.EditableNamespaceContext;

import java.util.Arrays;

public class RangeIndexBug {

    public static void main(String[] args) {
        DatabaseClient client = DatabaseClientFactory.newClient("localhost", 8006, new DatabaseClientFactory.DigestAuthContext("admin", "admin"));
        QueryManager queryManager = client.newQueryManager();

        StructuredQueryBuilder queryBuilder = queryManager.newStructuredQueryBuilder();
        EditableNamespaceContext namespaceContext = new EditableNamespaceContext();
        namespaceContext.put("nst", "namespace-test");
        queryBuilder.setNamespaces(namespaceContext);

        StructuredQueryDefinition queryDef = queryBuilder.range(
                queryBuilder.pathIndex("/root/nst:dateTime"),
                "xs:dateTime", StructuredQueryBuilder.Operator.GT, "1999-01-01T00:00:00"
        );

        // Try a regular search
        String results = queryManager.search(queryDef, new StringHandle()).get();
        System.out.println("Search results: " + results);

        // Try a QueryBatcher
        DataMovementManager dmm = client.newDataMovementManager();
        QueryBatcher qb = dmm.newQueryBatcher(queryDef)
                .onUrisReady(batch -> System.out.println("Items: " + Arrays.asList(batch.getItems())))
                .onQueryFailure(failure -> System.out.println("Failure: " + failure.getMessage()));
        dmm.startJob(qb);
        qb.awaitCompletion();
        dmm.stopJob(qb);
    }
}

Actual output: What did you observe? What errors did you see? Can you attach the logs? (Java logs, MarkLogic logs)

Here's the output of the queryManager.search (just a snippet to verify I get data back):

<search:response snippet-format="snippet" total="6" start="1" page-length="10" xmlns:search="http://marklogic.com/appservices/search">
  <search:result index="1" uri="/PutMarkLogicTest/5.xml" path="fn:doc(&quot;/PutMarkLogicTest/5.xml&quot;)" score="0" confidence="0" fitness="0" href="/v1/documents?uri=%2FPutMarkLogicTest%2F5.xml" mimetype="application/xml" format="xml">
    <search:snippet>
      <search:match path="fn:doc(&quot;/PutMarkLogicTest/5.xml&quot;)/root/*:dateTime"><search:highlight>2000-01-01T00:00:00.000000</search:highlight></search:match>
    </search:snippet>
  </search:result>

And here's the error I got from using QueryBatcher:

[main] INFO com.marklogic.client.datamovement.impl.QueryBatcherImpl - (withForestConfig) Using forests on [localhost] hosts for "test-marklogic-nifi-content"
[main] WARN com.marklogic.client.datamovement.impl.QueryBatcherImpl - threadCount not set--defaulting to number of forests (1)
[main] INFO com.marklogic.client.datamovement.impl.QueryBatcherImpl - Starting job batchSize=1000, threadCount=1, onUrisReady listeners=2, failure listeners=4
Failure: com.marklogic.client.FailedRequestException: Local message: failed to apply resource at internal/uris: Internal Server Error. Server Message: XDMP-UNBPRFX: (err:XPST0081) Prefix nst has no namespace binding . See the MarkLogic server error log for further detail.

Expected output: What specifically did you expect to happen?

I expected QueryBatcher to find the same 6 documents

Alternatives: What else have you tried, actual/expected?

No workaround that I can find.

ehennum commented 3 years ago

Good catch. Here's a guess as to what's going on.

Starting in 10.0-5, the Java API converts the query to a cts.query once during initialization instead of on every request.

In the com.marklogic.client.datamovement.impl.QueryBatcherImpl#QueryBatcherImpl() constructor on line 99, the cts.query serialization is captured.

Somewhere, the conversion to the cts.query (possibly within the REST API internal endpoint) loses the namespace binding.

ehennum commented 3 years ago

Based on investigation...

Initialization converts the Search API representation of a path range query to the cts representation, which is serialized to JSON before returning to the client.

A cts.pathRangeQuery() doesn't take namespace declarations, so it serializes to JSON without namespace declarations.

By contrast, a cts.pathReference() does take namespace declarations, which are serialized to JSON.

A cts.rangeQuery() takes a cts.pathReference(), so one way to fix the issue would be to modify the conversion from the Search API representation to the cts representation in this case. That approach, however, would risk introducing a backward incompatibility on a stable component.

Another way to solve the problem would be to serialize to XML if namespaces are used and to JSON otherwise. That approach, however, would add complexity to both the interface and implementation of the REST API.

An expedient solution is to use the original query for a structured query builder query with namespaces or for a raw query in XML format. The optimization will be skipped for such queries.

ehennum commented 3 years ago

The fix also skips the optimization and uses the original query if the query refers to persisted options.

If the functional tests have a path range indexes with namespace, a good functional test would use a query batcher to get some results.