Predictive batching - Githubissues

petervandivier / PsAdxArchiver

Generic exporter from Azure Data Explorer to an external table in Azure blob storage.

MIT License

1 stars 0 forks source link

Predictive batching #16

Open petervandivier opened 9 months ago

petervandivier commented 9 months ago

Uneven data distribution hurts queue throughput (#5) and can cause batch failure if a batch exceeds 60 minutes runtime.

Get a baseline export size and use it to predict what batch sizes are appropriate for a given data range using estimate_data_size() or similar.

petervandivier commented 9 months ago

Back-of-the-envelope math suggests 4gb batch sizes if you want to target 15 min run times per-batch.

5mb/s speed
x60 sec = 300mb
x15 min = ~4.5gb

Predictor function should allow for user input to set a custom run time (remembering the 60 min hard cap with a buffer).

petervandivier commented 9 months ago

Maybe steer clear of estimate_data_size() & stick to .show extents. estimate_data_size(*) appears to read the entire table into memory - which isn't super surprising in retrospect :sweat_smile: