XCC performance compared to DMSDK performance during bulk ingest

rjrudin commented 7 years ago

We have a Spring Batch-based data migration program that reads data from Oracle RDB and writes it to MarkLogic. We have a 3-node ML cluster that's now on 9.0-1.1, was on 8.0-6.3 before.

We've been using XCC and a custom "pooling" approach where we create an XCC ContentSource for each host. Each batch of documents to be written to ML is then handed off to a ContentSource in round robin fashion. A new Session is then created from the ContentSource, and then session.insertContent is called with an array of Content objects. Very simple, nothing fancy. We use a ContentCreateOptions object for each document, but all we do is set the format to XML, set the collections and permissions, and set the repair level to FULL.

We now have DMSDK support in the migration tool, alongside the XCC support (which one is used is determined by a command line argument). Our DMSDK code is very simple too, basically this:

    databaseClient = DatabaseClientFactory.newClient(host, port, user, password, DatabaseClientFactory.Authentication.BASIC);
    dataMovementManager = databaseClient.newDataMovementManager();
    writeBatcher = dataMovementManager.newWriteBatcher().withBatchSize(batchSize).withThreadCount(threadCount);
    jobTicket = dataMovementManager.startJob(writeBatcher);

When the migration program gets a batch of documents to write, it then just calls this for each document in the batch:

    writeBatcher.add(doc.getUri(), doc.getMetadata(), doc.getContent());

Once all the batches are written, we have the following cleanup code:

    writeBatcher.flushAndWait();
    dataMovementManager.stopJob(jobTicket);
    dataMovementManager.release();
    databaseClient.release();

Functionally, everything works fine, but we're consistently getting better results with our custom XCC approach. Details on the migration that we're testing with:

The migration inserts 2,948,131 documents
Each document is small - just 7 elements, with each element value having less than 100 characters

Details on our 3-node cluster - each node has the same specs - 48g memory; 24 X5660 cpus at 2.80ghz; and disk storage is 500gb, with plenty of space (we're testing against an empty database). The migration program is being run on a separate machine, with 145g memory; 24 x5675 cpus at 3.07ghz; and 500gb storage.

Here are the 4 test runs I did, all with a thread count of 64:

Library	Batch Size	Total Time (ms)	% of DMSDK
XCC	100	339,899	76%
XCC	200	343,193	77%
DMSDK	100	453,655	102%
DMSDK	200	444,564	100%

I'm going to do a few more runs, but these are consistent with all the other runs I've done.

I've included an export of Monitoring History during the time frame where I did the above 4 runs (they were done in that order too). Interesting notes about what's in there:

CPU is significantly higher during the DMSDK runs - around 50% compared to 25%
There are lots of ETC hits during the DMSDK runs, but none during the XCC runs. This confuses me.
There are lots more list cache hits during the DMSDK runs, which I figure is for the same reason as the ETC hits (there are plenty of duplicate values across the 2 million plus documents).

xcc-vs-dmsdk-overview-20170718-120732.xlsx

chiamarc commented 7 years ago

Just a quick note that aggregate average CPU for DMSDK is around 40% whereas for XCC it's around 21%... still, the execution time numbers speak for themselves and I wonder why DMSDK is doing roughly twice the amount of work. The ETC usage is suspicious as well.

sammefford commented 7 years ago

@rjrudin, are you able to share code?

rjrudin commented 7 years ago

Here's most of the code, just not in compilable form - that would take a bit more work. I can always Webex-demo this too.

Code for creating a ContentSource

https://github.com/marklogic-community/ml-javaclient-util/blob/master/src/main/java/com/marklogic/xcc/template/XccTemplate.java . This is a Spring-style Template/Callback setup so that the Callback implementation doesn't have to worry about opening/closing a Session.

One XccTemplate is created per host in the ML cluster.

Code for inserting data

This is the XccCallback implementation that's passed into the XccTemplate. The Document object is just a simple class with a URI, content, metadata, etc - kind of like a DocumentWriteOperation.

import java.io.ByteArrayInputStream;
import java.util.HashSet;
import java.util.List;
import java.util.Set;

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import com.marklogic.client.helper.LoggingObject;
import com.marklogic.xcc.Content;
import com.marklogic.xcc.Session;
import com.marklogic.xcc.exceptions.RequestException;
import com.marklogic.xcc.template.XccCallback;

public class InsertDocumentsCallback extends LoggingObject implements XccCallback<Void> {

    private final static Logger PROGRESS_LOGGER = LoggerFactory.getLogger("PROGRESS");

    private List<Document> list;

    private static boolean infoEnabled = false;
    private static boolean debugEnabled = false;

    public static void initialize() {
        infoEnabled = PROGRESS_LOGGER.isInfoEnabled();
        debugEnabled = PROGRESS_LOGGER.isDebugEnabled();
    }

    public InsertDocumentsCallback(List<Document> list) {
        this.list = list;
    }

    @Override
    public Void execute(Session session) throws RequestException {
        Set<Content> docs = new HashSet<>();
        for (Document doc : list) {
            docs.add(doc.toContent());
        }
        int size = docs.size();
        boolean debug = logger.isDebugEnabled();
        if (infoEnabled) {
            PROGRESS_LOGGER.info("Inserting " + size + " documents");
        }
        try {
            session.insertContent(docs.toArray(new Content[] {}));
            if (debugEnabled) {
                PROGRESS_LOGGER.debug("Finished inserting " + size + " documents");
            }
        } catch (Exception e) {
            logger.error("Unable to insert batch of documents; cause: " + e.getMessage()
                    + "; logging each document that could not be inserted");
        }
        return null;
    }
}

Code for initializing DMSDK:

databaseClient = DatabaseClientFactory.newClient(host, port, user, password, DatabaseClientFactory.Authentication.BASIC);
dataMovementManager = databaseClient.newDataMovementManager();
writeBatcher = dataMovementManager.newWriteBatcher().withBatchSize(batchSize).withThreadCount(threadCount);
jobTicket = dataMovementManager.startJob(writeBatcher);

The code for adding documents is simple - no transform, just a URI, metadata, and content.

Code for cleaning up DMSDK:

writeBatcher.flushAndWait();
dataMovementManager.stopJob(jobTicket);
dataMovementManager.release();
databaseClient.release();

sammefford commented 7 years ago

I must be doing something wrong. My XCC implementation is taking 2 1/2 - 4 mins--most of it in the session.insertContent method, whereas my DMSDK implementation is taking 25 - 27 seconds. Here's the code:

import java.io.ByteArrayInputStream;
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.atomic.AtomicLong;
import java.util.concurrent.Executors;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.TimeUnit;

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import com.marklogic.client.helper.LoggingObject;
import com.marklogic.xcc.Content;
import com.marklogic.xcc.ContentCreateOptions;
import com.marklogic.xcc.ContentFactory;
import com.marklogic.xcc.ContentSource;
import com.marklogic.xcc.ContentSourceFactory;
import com.marklogic.xcc.Session;
import com.marklogic.xcc.exceptions.RequestException;
import com.marklogic.xcc.template.XccCallback;

public class Issue779Xcc {
  private final static Logger logger = LoggerFactory.getLogger(XccBulkWrite.class);

  public static void main(String[] args) throws Exception {
    new Issue779Xcc().run();
  }

  public void run() throws Exception {
    ContentSource source = ContentSourceFactory.newContentSource("engrlab-129-192.engrlab.marklogic.com", 8012, "admin", "admin");
    Session[] sessions = new Session[3];
    sessions[0] = source.newSession();
    source = ContentSourceFactory.newContentSource("engrlab-129-023.engrlab.marklogic.com", 8012, "admin", "admin");
    sessions[1] = source.newSession();
    source = ContentSourceFactory.newContentSource("engrlab-128-167.engrlab.marklogic.com", 8012, "admin", "admin");
    sessions[2] = source.newSession();

    ExecutorService executor = Executors.newFixedThreadPool(3);
    ContentCreateOptions meta = ContentCreateOptions.newTextInstance();
    meta.setCollections(new String[] {"Issue779Xcc"});
    AtomicLong timeSpent = new AtomicLong(0);
    for ( int i=0; i < 100; i++ ) {
      final List<Content> list = new ArrayList<>();
      for ( int j=0; j < 100; j++ ) {
        list.add(ContentFactory.newContent(i + "_" + j + ".txt", "abc", meta));
      }
      final Session session = sessions[i % 3];
      executor.execute(() -> new InsertDocumentsCallback(list, timeSpent).execute(session));
    }
    executor.shutdown();
    executor.awaitTermination(2, TimeUnit.MINUTES);
    executor.shutdownNow();
    executor.awaitTermination(2, TimeUnit.MINUTES);
    logger.debug("Time Spent just inserting: {}", timeSpent.get());
  }

  public class InsertDocumentsCallback extends LoggingObject implements XccCallback<Object> {
    private List<Content> list;
    private AtomicLong timeSpent;

    public InsertDocumentsCallback(List<Content> list, AtomicLong timeSpent) {
      this.list = list;
      this.timeSpent = timeSpent;
    }

    @Override
    public Object execute(Session session) {
      int size = list.size();
      boolean debug = logger.isDebugEnabled();
      logger.info("Inserting {} documents", size);
      try {
        long start = System.currentTimeMillis();
        session.insertContent(list.toArray(new Content[] {}));
        long duration = System.currentTimeMillis() - start;
        logger.debug("Finished inserting {} documents in {} millis", size, duration);
        timeSpent.getAndAdd(duration);
      } catch (Exception e) {
        logger.error("Unable to insert batch of documents; cause: " + e.getMessage()
            + "; logging each document that could not be inserted");
      }
      return null;
    }
  }
}

package dataMovementTests;

import com.marklogic.client.DatabaseClient;
import com.marklogic.client.DatabaseClientFactory;
import com.marklogic.client.DatabaseClientFactory.BasicAuthContext;
import com.marklogic.client.DatabaseClientFactory.DigestAuthContext;
import com.marklogic.client.io.DocumentMetadataHandle;
import com.marklogic.client.datamovement.DataMovementManager;
import com.marklogic.client.datamovement.JobTicket;
import com.marklogic.client.datamovement.WriteBatcher;

public class Issue779DMSDK {
  public static void main(String[] args) {
    DatabaseClient databaseClient = DatabaseClientFactory.newClient("engrlab-129-192.engrlab.marklogic.com", 8012, new BasicAuthContext("admin", "admin"));
    DataMovementManager dataMovementManager = databaseClient.newDataMovementManager();
    WriteBatcher writeBatcher = dataMovementManager.newWriteBatcher()
      .withBatchSize(100)
      .withThreadCount(3);
    DocumentMetadataHandle meta = new DocumentMetadataHandle().withCollections("Issue779DMSDK");
    JobTicket jobTicket = dataMovementManager.startJob(writeBatcher);
    for ( int i=0; i < 10000; i++ ) {
      writeBatcher.addAs(i + ".txt", meta, "abc");
    }
    writeBatcher.flushAndWait();
    dataMovementManager.stopJob(jobTicket);
    dataMovementManager.release();
    databaseClient.release();
  }
}

sammefford commented 7 years ago

Here's an excerpt from the logging when I run Issue779Xcc:

11:50:11.762 [pool-1-thread-1] INFO  Issue779Xcc$InsertDocumentsCallback - Inserting 100 documents
11:50:18.594 [pool-1-thread-3] DEBUG Issue779Xcc$InsertDocumentsCallback - Finished inserting 100 documents in 7076 millis
11:50:18.817 [pool-1-thread-2] DEBUG Issue779Xcc$InsertDocumentsCallback - Finished inserting 100 documents in 9927 millis
11:50:19.082 [pool-1-thread-1] DEBUG Issue779Xcc$InsertDocumentsCallback - Finished inserting 100 documents in 7320 millis
11:50:19.083 [Issue779Xcc.main()] DEBUG XccBulkWrite - Time Spent just inserting: 378993
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 02:26 min

sammefford commented 7 years ago

Ok, it turns out the 4-5 minutes was the actual time for the Issue779Xcc to run from my machine over VPN to the ML cluster. The only reason I was seeing anywhere near 2 mins was I cut it short with these two lines:

executor.awaitTermination(2, TimeUnit.MINUTES);
executor.shutdownNow();

But when I run the same code from a machine on a local network to the ML cluster, the time drops to 7.4s - 8.6s. The time for Issue779DMSDK also drops, but only to 9.3s - 11.2s. So I'm now seeing similar to what you are, that XCC is 28% faster that DMSDK when the network is fast.

rjrudin commented 7 years ago

I'm actually pooling ContentSource objects too - one per host - and a Session is used to insert N documents and then discarded. I haven't looked into the internals of XCC to understand the tradeoffs of keeping a Session open for a long time. But it doesn't appear to matter since you got similar results.

erinmiller commented 7 years ago

Hi Rob and Sam-- Can you share the versions of DMSDK and XCC that you are running? Thanks.

sammefford commented 7 years ago

DMSDK version is Java Client API 4.0.1 XCC version is 9.0.2

sammefford commented 7 years ago

Here are some interesting things I've observed through my testing. I don't have any conclusions yet, just observations:

For both Issue779Xcc and Issue779DMSDK I changed # docs written to 100,000 num client threads to 64

Baseline (avg of 4 runs) for Issue779Xcc: 11.1555s Baseline (avg of 4 runs) for Issue779DMSDK: 13.0805 (17% slower than Xcc)

When I wipe out all REST logic in Modules/MarkLogic/rest-api/endpoints/document-item-update.xqy and just leave xdmp:set-response-code(200,"Bulk Change Written") then my DMSDK process takes an avg of 4.3274s, so that's the amount of time for the Java Client to generate and send the docs and for the REST API to spin up the rewriter and XQuery and respond.

When I only add back in the equivalent of document-insert 100 times (once per doc in a batch of 100), using the following code, I get an avg of 6.25875s. So that appears to be about the amount of time to just write the docs. One caveat is since these docs have a random in their uri there's less time spent locking existing uris. Perhaps I should change the tests to use randoms in the uris just to compare...

xquery version "1.0-ml";

let $properties := <collection>Issue779DMSDK</collection>
let $options :=
  <options xmlns="xdmp:document-insert">
    <collections>{ $properties }</collections>
  </options>
let $random := xdmp:random(1000000)
let $document := text {"abc"}
for $i in (1 to 100)
  let $uri := $random || "_" || $i || ".txt"
return (
  xdmp:set-response-code(200,"Bulk Change Written"),
  xdmp:document-insert($uri,$document,$options)
)

When I add in the xdmp:document-set-properties in the code below, I get an avg of 10.9s.

xquery version "1.0-ml";

let $properties := <collection>Issue779DMSDK</collection>
let $options :=
  <options xmlns="xdmp:document-insert">
    <collections>{ $properties }</collections>
  </options>
let $random := xdmp:random(1000000)
let $document := text {"abc"}
for $i in (1 to 100)
  let $uri := $random || "_" || $i || ".txt"
return (
  xdmp:set-response-code(200,"Bulk Change Written"),
  xdmp:document-insert($uri,$document,$options),
  xdmp:document-set-properties($uri,$properties)
)

When I did the same without the random it was very slow (54.7275s).

let $properties := <collection>Issue779DMSDK</collection>
let $options :=
  <options xmlns="xdmp:document-insert">
    <collections>{ $properties }</collections>
  </options>
let $document := text {"abc"}
for $i in (1 to 100)
  let $uri := $i || ".txt"
return (
  xdmp:set-response-code(200,"Bulk Change Written"),
  xdmp:document-insert($uri,$document,$options),
  xdmp:document-set-properties($uri,$properties)
)

When I put everything back to normal, but I just delete from Modules/MarkLogic/rest-api/models/document-model-update.xqy the statement xdmp:document-insert($uri,$document,$options) then I get avg 6.77375s. When I also delete the statement xdmp:document-set-properties($uri,$properties) then I get avg 4.9157s.

sammefford commented 7 years ago

Side note: when I set Xcc threads to 32, avg drops to 12.73575s. When I set Xcc threads to 16, avg drops to 17.87825s. For DMSDK changing # threads 16 didn't seem to affect the timings.

sammefford commented 7 years ago

I tried with the develop branch (with OkHttp instead of Jersey) and I got an avg of 12.74675s (14% slower than XCC, but that's better than the 17% slower we got with Jersey). When I once again strip the REST impl down to do nothing (so I can measure just the Java side and rewriter/Xquery init side) I got an avg of 3.237s (33% faster than the 4.3274s time we got with Jersey in the 4.0.1 version).

erinmiller commented 7 years ago

@sammefford, this is all great info. I am already drawing some conclusions myself, but I’d like to repro your results on the perf environment and do some profiling. I’m most interested/surprised by the properties insert taking so much time.

Is the code that you’re running now exactly the same as the code in issue 779? If so, I’ll start playing with it as well.

sammefford commented 7 years ago

Yes. I made one small tweak to the code due to #793. But the code here includes that fix.

rjrudin commented 7 years ago

We're on version 8.0.4 of XCC, and when I tried DMSDK, it was 4.0.1. I'm going to bump up our XCC version, as we're on ML 9.0-1.1 (haven't had a chance to go to 9.0-2 yet).

sammefford commented 7 years ago

FWIW, I did much of my testing with version 8.0.5 of XCC, and the avg was 11.1s, so roughly equivalent.

sammefford commented 7 years ago

One big thing I'm noticing in my testing is that things go into the database much faster the first time. This is probably expected behavior since the first time there's no existing document that needs to be locked. But as I mentioned above, all my tests would probably run faster if I added a random to the uris so everything is going in for the first time. I just didn't see that as the most important thing to test.

For reference here are some clusters of tests showing the first run as much faster. The first two are XCC and the last is DMSDK.

7.325 10.687 11.764 10.914

8.368 11.714 11.527 11.923

8.384 12.198 15.562 13.008 12.972

erinmiller commented 7 years ago

Are you inserting over existing URIs? Essentially a document update? If so, it’s not locking that’s the issue, it’s the fact that the existing fragment must be marked for deletion and the new fragment inserted. The deleted fragments are cleaned up via merge, which means merge has more work to do.

sammefford commented 7 years ago

Yes, we're inserting over existing uris, essentially a document update. I figure that's a common scenario for data movement applications.

Thanks for the clarification that it's marking for deletion not locking that's the issue. Am I correct to assume that merge should be largely independent of the timings we're seeing?

erinmiller commented 7 years ago

Well, merge may or may not be independent. If you’re doing big merge writes, you may see response times dip if you have an IO bottleneck, so more merges can definitely impact performance on IO constrained systems.

sammefford commented 7 years ago

In order to build my classes above, I used the following maven pom.xml:

<?xml version="1.0" encoding="utf-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>com.marklogic</groupId>
  <artifactId>javatests</artifactId>
  <packaging>jar</packaging>
  <version>0</version>
  <name>.</name>
  <description>.</description>
  <repositories>
    <repository>
      <id>jcenter</id>
      <url>http://jcenter.bintray.com </url>
    </repository>
  </repositories>
  <build>
    <sourceDirectory>.</sourceDirectory>
    <plugins>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-compiler-plugin</artifactId>
        <version>2.5.1</version>
        <configuration>
          <source>1.8</source>
          <target>1.8</target>
          <includes>
            <include>Issue779Xcc.java</include>
            <include>Issue779DMSDK.java</include>
          </includes>
        </configuration>
      </plugin>
    </plugins>
  </build>
  <dependencies>
    <dependency>
      <groupId>com.marklogic</groupId>
      <artifactId>marklogic-client-api</artifactId>
      <version>5.0-SNAPSHOT</version>
      <!--
      <version>4.0.1</version>
      -->
    </dependency>
    <dependency>
      <groupId>com.marklogic</groupId>
      <artifactId>ml-javaclient-util</artifactId>
      <version>2.14.0</version>
    </dependency>
    <dependency>
      <groupId>com.marklogic</groupId>
      <artifactId>marklogic-xcc</artifactId>
      <version>9.0.2</version>
    </dependency>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>4.12</version>
    </dependency>
  </dependencies>
</project>

And I ran the tests with the following commands:

mvn compile exec:java -Dexec.cleanupDaemonThreads=false -Dexec.mainClass="Issue779DMSDK"
mvn compile exec:java -Dexec.cleanupDaemonThreads=false -Dexec.mainClass="Issue779Xcc"

sammefford commented 7 years ago

In order to use <version>5.0-SNAPSHOT</version> of marklogic-client-api, I had to clone the develop branch and run this command:

mvn -Dmaven.test.skip -Dmaven.javadoc.skip install

erinmiller commented 7 years ago

Ok, I just ran this on our performance hardware and I'm not able to reproduce--in fact, on the perf hardware, the DMSDK is faster (3073 seconds for DMSDK versus 4356 for XCC). So I suspect that there's something fundamentally different about how the DMSDK uses resources that exposes a bottleneck differently on different hardware.

Sam, can I take a look at the environment where you're able to reproduce?

sammefford commented 7 years ago

How many documents did you insert? 3073 seconds is a lot more than the 13 seconds I was seeing for 100,000 docs!

erinmiller commented 7 years ago

Offline, Sam and I had a convo about the number of docs inserted, number of threads, etc. I think I've reproduced the problem at this point and I'm going to debug further on our hardware.

damonfeldman commented 6 years ago

In Rob's original note I see that REST is causing Expanded Tree Cache (i.e. query) activity, where XCC does not. Could this be why the performance characteristics vary so much? That some systems are more sensitive to query load.

BTW, it is quite counter-intuitive that a REST PUT will drive any query activity whatsoever.

xquery commented 6 years ago

As per recent internal discussions - might be worth raising this with PM and Support to get this more firmly on the radar - if work has to be done we can get into the plan.

I would be curious to see behavior with just pure curl - I know this has been raised against java api client - though would be useful to occlude http clients contribution to this.

sammefford commented 6 years ago

I'm not sure it would make sense to attempt to recreate this scenario against curl. We're talking about multi-threaded multi-host writing here. This ninja move isn't even possible with vanilla XCC, only with the addition of ml-javaclient-util library.

sammefford commented 6 years ago

@xquery also note, this above "When I once again strip the REST impl down to do nothing (so I can measure just the Java side and rewriter/Xquery init side) I got an avg of 3.237s (33% faster than the 4.3274s time we got with Jersey in the 4.0.1 version)." So that means Java + HTTP + rewriter are contributing 4.3s to the overall time.

damonfeldman commented 6 years ago

I'd also be interested in measuring the case where the "REST impl [is] down to nothing" to see the CPU load. For an I/O bound system, the extra CPU might not have much effect on timings or throughput, but on a CPU bound system, it could have a pronounced effect.

jmakeig commented 6 years ago

Are ingest-heavy workloads typically CPU-bound? Are you talking about complex transformations on ingest? Specifics would be helpful.

xquery commented 6 years ago

FWIW easy to do parallel requests with libcurl ... also possible to do parallel multiple requests with curl command (w/ bash/ssh fu it is simple to spawn off 1000's parallel requests across multiple hosts).

I think it makes sense to isolate where the bottlenecks could be hence my suggestion of curl - sounds like Erik already has a bead on it.

sammefford commented 6 years ago

When I change my payload to be a more realistic document, the pom.xml file listed above, a 1.8K file, Xcc now performs much worse (30% slower) than DMSDK. It appears with the tiny documents Xcc is faster, but with more normal documens with more than 1000 characters, DMSDK is faster. Can you see if you can reproduce with my code, Rob?

I tried my best to reproduce your env. I only have VM's to work with though. 4 nodes, each with 48GB ram and 24 CPUs. 1 for Java client. 3 with ML 9.0-3. I'm using Java-client-api 5.0-SNAPSHOT (develop branch) but that's equivalent to 4.0.3 right now.

sammefford commented 6 years ago

Issue779Xcc.java

import java.io.ByteArrayInputStream;
import java.io.BufferedReader;
import java.io.FileReader;
import java.util.ArrayList;
import java.util.stream.Collectors;
import java.util.List;
import java.util.concurrent.atomic.AtomicLong;
import java.util.concurrent.Executors;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.TimeUnit;

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import com.marklogic.client.helper.LoggingObject;
import com.marklogic.xcc.Content;
import com.marklogic.xcc.ContentCreateOptions;
import com.marklogic.xcc.ContentFactory;
import com.marklogic.xcc.ContentSource;
import com.marklogic.xcc.ContentSourceFactory;
import com.marklogic.xcc.DocumentFormat;
import com.marklogic.xcc.Session;
import com.marklogic.xcc.exceptions.RequestException;
import com.marklogic.xcc.template.XccCallback;

public class Issue779Xcc {
  private final static Logger logger = LoggerFactory.getLogger(Issue779Xcc.class);

  public static void main(String[] args) throws Exception {
    new Issue779Xcc().run();
  }

  public void run() throws Exception {
    ContentSource source = ContentSourceFactory.newContentSource("engrlab-129-192.engrlab.marklogic.com", 8012, "admin", "admin");
    Session[] sessions = new Session[3];
    sessions[0] = source.newSession();
    source = ContentSourceFactory.newContentSource("engrlab-129-023.engrlab.marklogic.com", 8012, "admin", "admin");
    sessions[1] = source.newSession();
    source = ContentSourceFactory.newContentSource("engrlab-128-167.engrlab.marklogic.com", 8012, "admin", "admin");
    sessions[2] = source.newSession();

    ExecutorService executor = Executors.newFixedThreadPool(64);
    ContentCreateOptions meta = ContentCreateOptions.newXmlInstance();
    meta.setFormat(DocumentFormat.XML);
    //meta.setCollections(new String[] {"Issue779Xcc"});

    String xml = new BufferedReader(new FileReader("pom.xml")).lines().collect(Collectors.joining(""));

    AtomicLong timeSpent = new AtomicLong(0);
    for ( int i=0; i < 1000; i++ ) {
      final List<Content> list = new ArrayList<>();
      for ( int j=0; j < 100; j++ ) {
        list.add(ContentFactory.newContent(i + "_" + j + ".xml", xml, meta));
      }
      final Session session = sessions[i % 3];
      executor.execute(() -> new InsertDocumentsCallback(list, timeSpent).execute(session));
    }
    executor.shutdown();
    executor.awaitTermination(10, TimeUnit.MINUTES);
    executor.shutdownNow();
    executor.awaitTermination(10, TimeUnit.MINUTES);
    logger.debug("Time Spent just inserting: {}", timeSpent.get());
  }

  public class InsertDocumentsCallback extends LoggingObject implements XccCallback<Object> {
    private List<Content> list;
    private AtomicLong timeSpent;

    public InsertDocumentsCallback(List<Content> list, AtomicLong timeSpent) {
      this.list = list;
      this.timeSpent = timeSpent;
    }

    @Override
    public Object execute(Session session) {
      int size = list.size();
      boolean debug = logger.isDebugEnabled();
      logger.info("Inserting {} documents", size);
      try {
        Content[] content = list.toArray(new Content[] {});
        long start = System.currentTimeMillis();
        session.insertContent(content);
        long duration = System.currentTimeMillis() - start;
        logger.debug("Finished inserting {} documents, in {} millis, Uris {} to {}",
          size, duration, list.get(0).getUri(), list.get(list.size() - 1).getUri());
        timeSpent.getAndAdd(duration);
      } catch (Exception e) {
        logger.error("Unable to insert batch of documents; cause: " + e.getMessage()
            + "; logging each document that could not be inserted");
      }
      return null;
    }
  }
}

sammefford commented 6 years ago

Issue779DMSDK.java

import java.io.BufferedReader;
import java.io.FileReader;
import java.util.stream.Collectors;
import com.marklogic.client.DatabaseClient;
import com.marklogic.client.DatabaseClientFactory;
import com.marklogic.client.DatabaseClientFactory.BasicAuthContext;
import com.marklogic.client.DatabaseClientFactory.DigestAuthContext;
import com.marklogic.client.io.Format;
import com.marklogic.client.io.DocumentMetadataHandle;
import com.marklogic.client.io.StringHandle;
import com.marklogic.client.datamovement.DataMovementManager;
import com.marklogic.client.datamovement.JobTicket;
import com.marklogic.client.datamovement.WriteBatcher;

public class Issue779DMSDK {
  public static void main(String[] args) throws Exception {
    DatabaseClient databaseClient = DatabaseClientFactory.newClient("engrlab-129-192.engrlab.marklogic.com", 8012, new BasicAuthContext("admin", "admin"));
    DataMovementManager dataMovementManager = databaseClient.newDataMovementManager();

    WriteBatcher writeBatcher = dataMovementManager.newWriteBatcher()
      .withBatchSize(100)
      .withThreadCount(64);
    JobTicket jobTicket = dataMovementManager.startJob(writeBatcher);
    String xml = new BufferedReader(new FileReader("pom.xml")).lines().collect(Collectors.joining(""));
    StringHandle xmlContent = new StringHandle(xml).withFormat(Format.XML);
    for ( int i=0; i < 100000; i++ ) {
      writeBatcher.addAs(i + ".xml", xmlContent);
    }
    writeBatcher.flushAndWait();
    dataMovementManager.stopJob(jobTicket);
    dataMovementManager.release();
    databaseClient.release();
  }
}

DALDEI commented 6 years ago

Several things not mentioned here. XCC (generally) does not incur any server side interpreted (xquery / JS) calls even when connected to a 'normal' app server port (like 8000). Pure XDBC ports have even less code invovled. The XML rewriter directs the XDBC protocol to a few special internal endpoints which flow directly through C++ code to do document insert / update / delete ( query/eval obvious invoke query) "REST" API Based code always invokes some XQuery code early on in the connection even if it does not use a XQuery based rewriter - the dispatching code of REST invokes XQuery code to implement the REST dispatch and result processing. I believe this is still true for 'pure' document puts ( in order to implement some advanced features like quality, relevant, transactions, collections ). On the way 'out' the same is true, XCC returns results directly through the native C++ code path whereas REST calls use a common response formatted. The intent of combining the XCC and HTTP ports in V8 was the presumption that this overhead was high and could be eliminated. It was never tested to my knowledge and the 'pure document put' path was not implemented for REST calls. It is available if you write a custom XML rewriter (essentially use the XCC Protocol but without the XCC library code). This can be validated or disproved by adding trace diagnostics for HTTP and Rewriter.

Another difference, the XCC Library itself is highly optimized at a low level - 'barely' HTTP compliant and certainly not 'Restful'. It makes heavy use of NIO and async operations. Back to back writes are pushed aggressively on the same open connection using HTTP 1.1 protocol. It can stream very well in both directions as long as the buffers are not skewed too much (too much more receive then send). Im about to start on some tests (for other reasons) to see if I can validate these theories.

I have not seen formal tests, but informal test between mlcp (uses xcc) and code using the Java REST library for pure document pushes, particularly many small documents back to back, xcc comes out way ahead.

Another difference is XCC uses no third party libraries 'in middle tiers' -- it doesn't do PoJo or Document conversions, reflection, use 3rd party HTTP libraries. It streams directly from a pre-prepared (binary) stream of document data directly over the socket layer, communicating flow control and metadata at a the HTTP layer with the server. The code-path in the server is 'hard coded' for XCC and can handle many conditions in C++ logic that require XQuery interpretive code for the REST API.

I do believe the advantage in speed for XCC decreases wrt REST as the document size increases and as batch size increases (for multi document writes in REST) -- this is to be expected if the above assertions are true -- that the majority of the performance difference is in the protocol set, dispatching, response processing and tear down. For large (or large numbers on HTTP/1.1) documents, then the document insert size prevails over protocol overhead -- Providing of course you dont use the advanced features of various REST based libraries such as transformation on write -- those artificially skew the performance.

A few days ago I profiled a simple Java API application that wrote few (45) small documents over many seconds using the PoJo interface. It was surprising how much more then I expected the additional overhead of the Java side of the API layer contributed to overall time. Reflection based serialization, general purpose code to figure out metadata (even if none were present) and sheer vast amounts of short lived object creation at various layers until the final HTTP call was made. Then the result wound back up doing the same. This is why I want to do a more formal test -- too many side factors in play that skew the results. ( one time digest authentication, class loading, DNS, possible ineffective connection pooling etc. But simply running JProfile showed a huge 'inverted pyramid' from the simple '1 liner' write POJO code down to where the actual socket writes were done. For these small objects the overhead was significant portion of the time -- including the wait time for the ML response. Add that up and Im really impressed if the higher level APIs can achieve as good as 20% overhead.

I think this deserves some peer-reviewed co-authored testing. Not about one protocol or library being 'Better' -- Pure raw performance is not always the most important factor -- there is high value in usability and advanced features. But it would be helpful to a lot of people to have an objective (perhaps croud sourced) set of reference performance data showing as well as possible what factors influence the performance across the various ways of access ML. Most focus on performance tends to be to try to find the "Killer Injest" case -- that's a good case but its NOT ALL the useful cases. OTOH -- theres a generally use-case-agnostic recommendation to use a particular library -- when perhaps use case should be considered more -- and let users choose, backed with good quality information, what their use case would benefit from most -- raw blinding blob insert injection - complex transformations in-situ, a consistent API that 'plays nice' with their existing tools, assistance with Serialization of complex structures. And when the requirements change (TODAY Docs/Sec is more important then Lines of Code required or Library compatibility) there could data to suggest what direction to go next, perhaps on a path of least change from where they are now.

I would be interested in comparing

Java API vs XCC

POJO model
Document Model
Eval Model

then across the Space dimension (tiny docs, middle sized docs, huge docs). Then across various blocking/chunking/multi-doc capabilities. Report on both Latency and throughput, with breakdowns on both client and server side components (time spent in various 'layers' or 'functions'.

Then compare both against

Hand written simple HTTP Endpoint (doing a xdmp:document-insert() ) Single doc, and batchs

Then compare with the low level XML Rewriter ability to directly write to the Database without any interpreted code. (similar to how XCC is implemented in the server - but using standard HTTP and no client side library ).

Eval based insert vs PUT JSON vs XML documents "Text" vs "Structured" vs "Binary"

Then if time-permitting -- across different HTTP Client libraries.

A similar test would be useful -- using the REST Extensions to produce a custom REST Endpoint with read and write transformation extensions vs similar code written by hand in XQuery or JS. With and without use of the Java client API.

How about the OTHER API's , Node.js - python(?), some new dev's Kotlin API over Vert.x ? JavaScript vs XQuery vs 'pure C++' in the server ? Document size / rate Effect of Memory, CPU, input rate and document size vs throughput vs latency. Affect of adding Property Documents vs 'envelop' Affect of disabling universal Indexing . Push vs Pull (Pull == server code that does HTTP or xdmp:filesystem-file requests to load documents vs outside processes 'pushing' to HTTP Endpoints

Maybe we could 'put to bed' speculation and 'word of mouth' and simply get some facts out so people can choose better the balance of performance, ease of use, deployment, limitations,m feature set, compatibility, support , skill set, future-proofing etc. No one factor stands alone, but no factors stand at all if they are not measurable and measured.

If we could create a simple standard framework for such test, maybe we could get participation from the community in providing specific use cases, along with what real-world application workloads they model. It would be very useful to evolve a validated reference where once could ask 'Using a workload like MINE and a cluster like MINE what are the API options, pros, cons, performance and costs, objectively determined by -- public blockchain validated and immutable audit trail of -- community interested researchers and their tests.

Sorry I had to put "blockchain" in somewhere :)

I did something like this in a small scale years ago -- and to my surprise its still running

http://speedtest.xmlsh.org/

I was able to enlist about a thousand 'volunteers' over a 2 month period -- ( well 'amazon mechanical Turk paid volunteers' included) -- but it was worth the small cost to get a wide range of devices and independent tests by people with no correlated bias in the results. Although some did get very Vocal and Annoyed when the results they wanted they couldn't reproduce even on their own systems. One interesting fellow who insisted that the results (which he never bother trying) were provably false because "Math Proved Otherwise" - and he did claim to have a Math Degree -- so well maybe Math beats Experimental Evidence. No need to run the actual tests as the results could only prove one thing - either the expected results (valid) or unexpected results (invalid). It would have been valuable if he'd used his advanced education to peer-review the implementation and discover fundamental flaws in it. I'm sure there were some --

I think there is a good possibility that provided a simple framework and a 'call to action'. was published 'The Community' would be interested enough in helping the cause. It could end a lot of needless debuts based on pure speculation and go direction to helping people make the right choices instead of arguing about them. Maybe find areas of agreement, perhaps assistance, that could be improved. Present new 'real world' cases that hadn't been considered. Add previously-unconvinced external factors with a forum to test and debate them. (what does SSL do to the picture, how about Digest Auth, Load Balencers ? geographic distance (aka latency) , client or server load ? memory and CPU constraints, VM or Containers, proxies, different application environments, different measurement techniques and concepts, CPU vs Network Speed, RAM vs Virtualization.

Anyone interested ? Help decide on the methodology ? (so its not overly biased).
Help write standardized code snippets for the various libraries (to provide standard baselines as well as to compare deviations)

I was going to start something extremely simplistic for my own personal use case tomorrow -- any help on even the core basics like a build files and server setup (ml-gradle? ) - consistent way to launch and measure, standardized corpus ... would be appreciated and given back. Any interest in leading the way, I'd follow ! If not I hope to muddle along --and share the results. Maybe a good debate about "WTF did you do THAT for, its obviously bogus to the lamest of morons ! " might interest people into contributing their expertise. Do direct browser -> server interactions behave differently ? How about long distance ? Want to put YOUR pet project on the top of the list ? Publish a paper somewhere of the findings ? Can we prove once and for all that Windows Sucks, or .. maybe Windows Rocks. Plug the hardware/software/algorithm/wild-guess of your own and stack it against your peers. Win a Virtual Ego Trophy. Maybe a job doing performance enhancements.

rjrudin commented 6 years ago

@vivekmuniyandi @jmakeig @damonfeldman I am wondering if this can be closed with guidance on what value to use for "update-policy" at http://docs.marklogic.com/guide/rest-dev/service#id_43814 for the underlying REST API instance. It defaults to merge-metadata, and perhaps there's a better default value? I'm not able to tell based on the docs.

My guess is that my results differed from what Sam found based on differences in the underlying REST API server. I don't have the ability to rerun these results anymore. So I'm good with this being closed, would just be good to know how the values for "update-policy" are expected to impact DMSDK performance.

kcoleman-marklogic commented 6 years ago

I'm doubtful that tweaking update-policy will be very relevant unless you are spending a lot of time updating existing documents.

You could try using "overwrite-metadata" rather than "merge-metadata", but IDK whether it's going to make any differences in cases where the document doesn't exist to begin with. Vivek could probably say more, from inspecting the implementation.

damonfeldman commented 4 years ago

Now that Data Services have been put in place with a more optimized code path, should we close this out? See: https://github.com/marklogic/java-client-api/wiki/Data-Services .

ehennum commented 4 years ago

Point well taken @damonfeldman

This issue beneficially focused attention on the comparison.

The factors that contribute to cost (in particular, large enode frameworks with hooks for extensions and request rewriting) are well understood now.
The delta is much smaller for realistic record documents than for one-line documents.
The path forward is Data Service endpoints.

marklogic / java-client-api