This was causing us to eventually run out of heap space when publishing millions of records, since I was using console.log to output the statment ids, etc.
This adds a println utility function that calls process.stdout.write, and replaces all the console.logs with println. A lot of the console.logs wouldn't matter, since they're in one-off commands like mcclient id, but I kind of wanted to burn them all to the ground after chasing this leak all day 😄
There's a few other changes to the publish command:
instead of accumulating all promises for each batch publication and waiting on Promise.all, we only keep "in-flight" batch promises in a map and remove them when they resolve. Then when the input stream ends we just wait until the map is empty. This lets us avoid keeping thousands of promises around during huge ingestions.
I was using Array.slice needlessly (and confusingly) when printing the batch results, which was allocating objects for no good reason. I replaced that with a for loop
I'm now closing the input stream and killing the jq process on errors, before rejecting the main promise. Seems like a good idea.
Something I noticed is that for giant ingestions, the default timeout can be too low; I'll get timeouts on data/put commands if I'm a few million records in. Changing the global timeout works, but we should probably add some backoff / retry logic for things like putting data.
It turns out that console.log retains its arguments indefinitely and prevents them from being GC'd, so you can do fancy things with them like inspect them in dev tools.
This was causing us to eventually run out of heap space when publishing millions of records, since I was using
console.log
to output the statment ids, etc.This adds a
println
utility function that callsprocess.stdout.write
, and replaces all theconsole.log
s withprintln
. A lot of the console.logs wouldn't matter, since they're in one-off commands likemcclient id
, but I kind of wanted to burn them all to the ground after chasing this leak all day 😄There's a few other changes to the publish command:
instead of accumulating all promises for each batch publication and waiting on
Promise.all
, we only keep "in-flight" batch promises in a map and remove them when they resolve. Then when the input stream ends we just wait until the map is empty. This lets us avoid keeping thousands of promises around during huge ingestions.I was using
Array.slice
needlessly (and confusingly) when printing the batch results, which was allocating objects for no good reason. I replaced that with a for loopI'm now closing the input stream and killing the jq process on errors, before rejecting the main promise. Seems like a good idea.
Something I noticed is that for giant ingestions, the default timeout can be too low; I'll get timeouts on
data/put
commands if I'm a few million records in. Changing the global timeout works, but we should probably add some backoff / retry logic for things like putting data.