parallel range requests using too much CPU

palantir / atlasdb

Transactional Distributed Database Layer

https://palantir.github.io/atlasdb/

Apache License 2.0

55 stars 11 forks source link

parallel range requests using too much CPU #1606

Open abaker14 opened 7 years ago

abaker14 commented 7 years ago

atlasdb fix for internal issue QA-98132

issue found/filed by @kbrainard

parallel range scans are generating larger than expected CPU load. This was discovered when running a range scan with 8 threads was causing the process running atlas to be killed. Switching the range scan to run in serial saw a 4x performance improvement (when compared to the 8x parallelism), and experimenting with this they saw that even running with 2 threads was overloading the CPU.

schlosna commented 7 years ago

Hey @abaker14 and @kbrainard, I'll take a look at the internal ticket, but Can we add some more detail here about what specific KVS(es) this shows up on and the batch hint size?

rhero commented 7 years ago

Looks like this issue is pending some discussion on the internal QA ticket. Will update this ticket after there is a consensus there. Will put in backlog for now and I added myself as a watcher to the internal ticket.

Tagging @gbonik for SA.

rhero commented 7 years ago

Something that's not clear to me is the priority of the issue though. It looks like this issue came out of an investigation following a P0, but I can't tell if we think this potential bug was the primary contributor to bad performance, or if this is an issue we observed and are filing so we can do more investigation.

schlosna commented 7 years ago

@rhero I think its worth clarifying the AtlasDB docs about what exactly is implicitly parallelized for you under the covers by AtlasDB, so that a consumer understands how best to tailor their use of AtlasDB in the context of achieving optimal mechanical sympathy.

For example, the following currently vary based on the underlying KVS:

getRange is parallelized on Cassandra
getRows is partitioned by hosts owning given row name token and parallelized on Cassandra
getRows is partitioned into batches of N rows and parallelized on Postgres
multiPut is partitioned into batches of N cells and parallelized on Cassandra, Postgres, and Oracle

ilyanep commented 7 years ago

+1 to @schlosna 's comment. There seem to be two pieces here: Document these semantics (ideally on the javadoc for the API) and standardize the parallelization between underlying KVSes where possible.

jboreiko commented 7 years ago

Looks like some of these will be easier to expose compared to others. As an example, the multiPut is only called to enter the writes during a the commit phase of a transaction (and is not dependent on whether puts were grouped or not by the application logic). The getRows and getRanges can be applied in the shorter term since they are hit directly by calls on Transction.

jboreiko commented 7 years ago

The same is true of getRowsColumnRange