batchWrite unreliable - Githubissues

dawolf-tty commented 3 years ago

I noticed 2 main issues with the batchWrite function.

Requests not throttled When the amount of items is considerable the requests should be throttled to avoid issues; the throttling should be configurable by the caller or an exponential backoff could be implemented. Consider a quite common scenario such as putting 100K items to a table; with the current batchWrite implementation this would result in 4000 promises running in parallel. With such a rate even a table configured as 'on-demand' will quickly expires the provisioned throughput.

Bad exception management If an exception occurs during an await documentClient.batchWrite() call, and this will quickly happens when requests are not throttled, something really bad occurs: the await batchWrite() made by the caller does not wait anymore BUT the remaining chunks will still be written on the background! It is important to note that, as stated in the official doc,

If none of the items can be processed due to insufficient provisioned throughput on all of the tables in the request, then BatchWriteItem returns a ProvisionedThroughputExceededException.

So if the throughput is not sufficient to write x out of 25 items no exception will be raised and the unprocessed items will be returned in the promise result, so far so good; but if 25 out of 25 items will not be written an exception will be raised. The try...catch surrounding the return await Promise.all() will surely catch the exception but it is too late to keep the status of the writings under control. My advice is to manage the exceptions inside the code of the promises and a list of all the failed requests should be returned to the caller.

To overcome these issues I made some little modifications which you can find in this gist. I know that the real solution should be engineered better but I am a new comer in the javascript world so I do not really know what the best approach would be here.

Thanks for your work, I have learnt a lot studying your code.

breath103 commented 3 years ago

@dawolf-tty

Hi, thanks for the thorough research! you're indeed right about the current design flaw. or at least, lack of feature. so this is not the first time this issue been discussed, you can refer https://github.com/balmbees/dynamo-types/issues/29 this issue (from original repo)

Bad exception management in short - this is really about designing interface. we have multiple options for given situation (when batchWrite partially fails) a. just throw, let user know about failed / successed records in Error object b. return result, let user know about failed / successed records in response c. automatically retry, until default maxRetry count ...etc. i haven't really decided on this, since it is very tricky question.
let me know if you have thought about the design of interface. i'm still thinking about this.

Requests not throttled Quite frankly this hasn't been a major issue for me since i've been using on-demand capacity virtually on every table.
with that seeing ProvisionedThroughputExceededException is very very rare (unless hot partition issue or something)
I'm actually curious why you're not using on-demand capacity? (or have you but still seeing issue)

let me know, thanks.

dawolf-tty commented 3 years ago

Sometimes I use provisioned capacity (mainly without auto-scaling) because, in some circumstances, the cost must be exactly known in advance (customers require it). But this issue report is based on a massive batch upload to an on-demand table; capacity, as you probably know, is not "infinite" even with on-demand tables.

Anyway, unless explicitly asking AWS to increase the on-demand capacity (default is 40K read\write) this limit will be hit very quickly with the current implementation of the batchWrite. I have just tried with a set of 100K items (region: eu-west-1) and about 30% of the records are not being written. This is understandable because 4000 promises run in parallel each one of which tries to write 25 items.

During the last week I wrote about 100M records to an on-demand table, and this is quite common in enterprise migrations (at least on my experience); without the changes made in the gist that I have linked before it would have been a nightmare to handle.

So I think the best approach to quickly solve this issue and make batch writes reliable is the b option:

return result, let user know about failed / successed records in response

coupled with some sort of throttling which users should be able to customize accordingly to the provisioned capacity to minimize unprocessed items (the one in the gist is the quickest way which I came up with, but I am a javascript beginner). Be careful that the exception management must be implemented inside the promise (see the gist) otherwise entire unwritten chunks will be lost.

A more elegant\reliable approach would an exponential backoff implementation but I think that it could be added later.

breath103 commented 3 years ago

Hi, thanks for the explanation.

First, On usecases where you "need to" throttle, (based on customers requirements to know budget i guess?) it's very tricky - not just for batchWrite but any other queries it has potential to fail - i don't really have good answer for that honestly.

Second, For cases where even with on-demand but meeting limit, i completely aware of those 2 cases (account limit / hot partition)
but i'm still not sure about (b.) because this enforces developers to add snippet for checking failed records every time they use batchWrite and this feels wrong because it's obvious that what developer intended was to "success" every records anyway - nobody would be like.. "i'm trying batchWrite but ok to fail 10% of records".

const batchResult = await Table.batchWrite(changes);
if (batchResult.failed.length > 0) {
  // i guess retry..?
  await Table.batchWrite(changes);
}

when you have max capcity of X but you want to insert Y in time T, where (Y/T) > X,
only way of success is to change T you can't change Y, nor you can fix X while software is running. (you need to talk to AWS support at least...)

So i think

try {
  const batchResult = await Table.batchWrite(changes);
} catch (error) {
  // this should not happen, 
  // but when it happens developer can tell what exactly happend through checking error.
}

kind of implementation is better for developers. i'll check and think about changing implementation for batchWrite as follow:

adding concurrency control so that it doesn't burst the write capacity.
adding failed record control. (through error)

in the meantime, if you need this urgently handled on dynamorm, i suggest doing something like

bluebirdPromise = import("bluebird")
_ = import("lodash")
await bluebirdPromise.map(
  _.chunk(updates, 25),
  (updates) => Table.batchWrite(updates),
  { concurrency: 10 }
)

this ensures doing maximum concurrent batch of 10 for given moment (where each batch is 25) - essentially, this will be very similar with the implementation I'll create for batchWrite.

thanks.

dawolf-tty commented 3 years ago

I have temporarily solved the bulk load of millions of record with the code of the previous gist which worked like a charm. I ended up using async-sema to implement throttling because is a super simple and small library compared to bluebird.

Thanks for your support and explanation, is really appreciated.

breath103 commented 3 years ago

oh yeah that's better - bluebird is indeed too heavy;

strefethen commented 3 years ago

@dawolf-tty @breath103 I've just started digging into DymamoDB and I stumbled upon this repo and this specific issue and the above exchange is such a great exchange discussing challenging issues with such a wonderful balance and appreciation of perspective from each side. I just have to say kudos to you both. I've recently been moving data around in DynamoDB and this explains a lot so thank you for that.

serverless-seoul / dynamorm

batchWrite unreliable #20