palantir / atlasdb

Transactional Distributed Database Layer
https://palantir.github.io/atlasdb/
Apache License 2.0
51 stars 9 forks source link

Batch commit operations #1548

Open nziebart opened 7 years ago

nziebart commented 7 years ago

Update: Batching of 1-3 was done in #2072, running 7-9 async remains open.

Over the course of committing a write transaction, we make many remote calls:

  1. Get immutable lock timestamp (TimestampService#getTimestamp)
  2. Lock immutable lock timestamp (LockService#lock)
  3. Get immutable timestamp (LockService#getMinLockedInVersionId)
  4. Acquire write locks (LockService#lock)
  5. Get commit timestamp (TimestampService#getTimestamp)
  6. Verify write locks before commit (LockService#refreshLockRefreshTokens)
  7. Check if any locks expired after commit (LockService#refreshLockRefreshTokens)
  8. Unlock write locks (LockService#unlock)
  9. Unlock immutable ts lock (LockService#unlock)

These calls all require the remote service to verify that it is the leader, and have a roundtrip time of ~3-4ms when testing locally in an HA setup (recent straw perf tests of the timelock server show similar times). With the additional overhead of putUnlessExists, performing writes, and checking for conflicts, this means that we can expect close to ~50ms as a minimum time to commit any write.

Some ideas: Steps 7-9 could be done asynchronously Steps 1-3 could possibly be batched in some way, or perhaps only done once every x seconds instead of for every transaction

Testing locally, write transactions take 50% less total time by eliminating 1-3 and 7-9. Given that recent perf tests of a 3 node timelock server show similar times of ~3-4ms per request, this should

rhero commented 7 years ago

Hey @nziebart - this is really interesting. When you say you've tested locally, have you already implemented this? If so, could you throw up a PR for us to review? We'd love to take a look and run it through our other suite of benchmarks.

nziebart commented 7 years ago

So I did some experimentation with replacing all the HTTP calls with Websockets, with some pretty interesting results. Here are baseline local test results (HA setup) of PaxosTimelockServer#getFreshTimestamp:

17:40:03.627 [main] WARN  c.p.a.timelock.TimelockPerfTests - http: 10000 requests
17:40:03.627 [main] WARN  c.p.a.timelock.TimelockPerfTests - average request time: 3.6459
17:40:03.627 [main] WARN  c.p.a.timelock.TimelockPerfTests - total time: 36459

If we use a Websocket for the client -> server connection, we get:

17:39:27.166 [main] WARN  c.p.a.timelock.TimelockPerfTests - websocket: 10000 requests
17:39:27.168 [main] WARN  c.p.a.timelock.TimelockPerfTests - average request time: 1.9781
17:39:27.169 [main] WARN  c.p.a.timelock.TimelockPerfTests - total time: 19781

(approximately 50% time reduction)

Taking this a step further, if we use Websockets for the Paxos Learner / Acceptor communications, we get:

17:41:46.482 [main] WARN  c.p.a.timelock.TimelockPerfTests - websocket: 10000 requests
17:41:46.484 [main] WARN  c.p.a.timelock.TimelockPerfTests - average request time: 0.1946
17:41:46.485 [main] WARN  c.p.a.timelock.TimelockPerfTests - total time: 1946

(another reduction of ~50% of the original time)

Overall this gives an order-of-magnitude improvement. The theory is that since the baseline test needs two roundtrip http requests per call (one from client -> server, and another from server -> other server to obtain a paxos quorum), eliminating each one of these shaves off about 50% of the time.

Would be happy to clean this up a bit and push up a PR for further testing.

rhero commented 7 years ago

This looks sick. Awesome - we will definitely review the linked PR. Right now, @jeremyk-91 is focusing on the remaining work so we can ship the Timelock server (#1513), but this will be the first thing we look into once that work completes (so probably next week).

Thanks for putting in the work here!

gsheasby commented 7 years ago

@rhero I notice the PR didn't merge. Should we pick this up now that Timelock is "complete"? @jeremyk-91 for SA

tpetracca commented 7 years ago

We already moved to HTTP2, so probably worth another sanity check on perf boost before we rock the boat and do this, especially given all the work Jeremy has put into the HTTP2 stuff.

nziebart commented 7 years ago

IL tests of HTTP2 vs websockets:

1 client:
(http2): 1.5ms
(ws): 0.62ms

2 clients:
(http2): 2.5ms
(ws): 1.16ms

5 clients:
(http2): 2.6ms
(ws): 1.19ms

10 clients:
(http2): 3.0ms
(ws): 1.19ms

50 clients:
(http2): 8.0ms
(ws): 1.73ms

When I have time I'll try to dig into why websockets are faster. It might be mostly due to the overhead of jetty routing the request - in which case maybe the simpler solution is just creating our own http2 handler.


Test details:

Also, the websocket tests had an optimization to the timestamp service: it used a single thread to service all requests instead of locking. This probably biases the higher concurrency tests.

jboreiko commented 7 years ago

Separating these discussions into two tickets. This ticket will now follow the move to batch commit operations and the other ticket #1836 will follow the move from HTTP to websockets.

gsheasby commented 7 years ago

@jboreiko @nziebart was there a second PR about this?

hsaraogi commented 7 years ago

2072 perhaps fixes this.

hsaraogi commented 7 years ago

@jeremyk-91 to verify if #2072 actually fixes it.

jeremyk-91 commented 7 years ago

There are two parts to this ticket. 1-3 are indeed batched now (TimelockService.lockImmutableTimestamp()) will do everything in a single round trip.

7-9 currently are still done synchronously - we still do refresh and then unlock as three separate trips (see SnapshotTransactionManager.setupRunTaskwithLocks... and SnapshotTransaction.commitWrites)

So if I'm not wrong #2072 fixes the first part of this as part of moving to async lock, but not the second part.

gsheasby commented 7 years ago

Thanks for checking, @jeremyk-91. How painful are the three round trips in 7-9? Consequently, how important is it for us to fix it?

nziebart commented 7 years ago

there is actually another step not listed here which is fetching the start timestamp. this actually could be done at the same time as locking the immutable timestamp - so in total there are still 4 calls that could be eliminated.

fsamuel-bs commented 6 years ago

I think #2871 removes 7.