Open abaker14 opened 7 years ago
Hey @abaker14 and @kbrainard, I'll take a look at the internal ticket, but Can we add some more detail here about what specific KVS(es) this shows up on and the batch hint size?
Looks like this issue is pending some discussion on the internal QA ticket. Will update this ticket after there is a consensus there. Will put in backlog for now and I added myself as a watcher to the internal ticket.
Tagging @gbonik for SA.
Something that's not clear to me is the priority of the issue though. It looks like this issue came out of an investigation following a P0, but I can't tell if we think this potential bug was the primary contributor to bad performance, or if this is an issue we observed and are filing so we can do more investigation.
@rhero I think its worth clarifying the AtlasDB docs about what exactly is implicitly parallelized for you under the covers by AtlasDB, so that a consumer understands how best to tailor their use of AtlasDB in the context of achieving optimal mechanical sympathy.
For example, the following currently vary based on the underlying KVS:
getRange
is parallelized on CassandragetRows
is partitioned by hosts owning given row name token and parallelized on CassandragetRows
is partitioned into batches of N
rows and parallelized on PostgresmultiPut
is partitioned into batches of N
cells and parallelized on Cassandra, Postgres, and Oracle+1 to @schlosna 's comment. There seem to be two pieces here: Document these semantics (ideally on the javadoc for the API) and standardize the parallelization between underlying KVSes where possible.
Looks like some of these will be easier to expose compared to others. As an example, the multiPut
is only called to enter the writes during a the commit phase of a transaction (and is not dependent on whether puts were grouped or not by the application logic). The getRows
and getRanges
can be applied in the shorter term since they are hit directly by calls on Transction.
The same is true of getRowsColumnRange
atlasdb fix for internal issue QA-98132
issue found/filed by @kbrainard
parallel range scans are generating larger than expected CPU load. This was discovered when running a range scan with 8 threads was causing the process running atlas to be killed. Switching the range scan to run in serial saw a 4x performance improvement (when compared to the 8x parallelism), and experimenting with this they saw that even running with 2 threads was overloading the CPU.