opensafely-core / ehrql

ehrQL: the electronic health record query language for OpenSAFELY
https://docs.opensafely.org/ehrql/
Other
7 stars 3 forks source link

Use batched inserts for `InsertMany` #2008

Closed evansd closed 4 months ago

evansd commented 4 months ago

The InsertMany construct is most frequently used for codelists, which are small enough that it's not worth optimising. However it is also used for the @table_from_file feature where the row counts are much higher – and I have seen such inserts take 2 hours 40 minutes.

Using SQLAlchemy's batched inserts is about 10x faster on MSSQL, and about another order of magnitude faster again on Trino.

Closes #1934

cloudflare-pages[bot] commented 4 months ago

Deploying databuilder-docs with  Cloudflare Pages  Cloudflare Pages

Latest commit: 4f2f369
Status: ✅  Deploy successful!
Preview URL: https://e31d2e69.databuilder.pages.dev
Branch Preview URL: https://evansd-insert-many.databuilder.pages.dev

View logs

evansd commented 4 months ago

Given on l233 a batch_size's worth of rows are loaded into memory (AFAICT) - could 10k cause problems with some very wide tables-from-files?

Yeah, we do load a full batch into memory (SQLAlchemy refuses to work with it otherwise), but to use even 1GB of RAM we'd be talking 100KB per row, which seems very wide and likely to cause problems in other ways.

The SQLAlchemy docs seem quite vague about actual numbers here, so I guess a best guess is all we've got?

I think it's vague because it's configured per dialect. For MSSQL I think it does 999 which seems to match some limit somewhere.

Jongmassey commented 4 months ago

very wide and likely to cause problems in other ways.

Almost certainly. I do have a vague recollection of a supremely wide table at some point, but 100KB row it almost certainly wasnt!