medzin / beam-postgres

Light IO transforms for Postgres read/write in Apache Beam pipelines.
Apache License 2.0
12 stars 3 forks source link

(Question) How to set a batch-size for data to be fetched from DB? #2

Closed ff-sdesai closed 9 months ago

ff-sdesai commented 10 months ago

If I have 100k records in my table, will it first fetch all 100k first and then go to next step for each record? Also, is there a way to configure a batch size for data fetch say 10k?

medzin commented 10 months ago

Unfortunately, there is no way to configure reading in batches at the moment (there is only ReadAllFromPostgres function that assumes that everything will be returned in one database query). We could add ReadFromPostgres function that will read data in batches. It should be pretty straightforward if we assume that we don't need to handle updates/inserts in the meantime.

You can also try to beam.Reshuffle() after the database read to redistribute data among all workers.

ff-sdesai commented 10 months ago

It won't automatically re-distribute the read records among all workers without beam.Reshuffle()? Also, if there are lacs of records in DB table, it will result in query taking too much time? How much time will it take for you to add the function to read data in batches?

medzin commented 10 months ago

I will need one/two weeks to add that feature (I have a lot of other work to do before).

ff-sdesai commented 10 months ago

Thanks. One/two weeks should be fine with me. After you add support for batching which of the following two ways it will work?

  1. Fetches first batch, distributes it amongst the workers and starts fetching next batch from DB without waiting for workers to complete execution of first batch OR
  2. Fetches first batch, distributes it amongst the workers,waits for workers to finish processing first batch. Once done, starts fetching second batch
medzin commented 10 months ago

I will implement it as BoundedSource so the runner can decide the bundle size and the reading speed. I can add an extra option to limit batching if the runner will set the value too high for the database to handle.

ff-sdesai commented 10 months ago

Yes, that would be nice

ff-sdesai commented 10 months ago

@medzin Just wanted to check did you get a chance to start on this one?

ff-sdesai commented 9 months ago

@medzin We have integrated this library as input connector in our Apache beam pipeline. Can you please confirm if you have started on adding support for batched input reading?

medzin commented 9 months ago

@ff-sdesai started working on the code here: https://github.com/medzin/beam-postgres/tree/issue/2

ff-sdesai commented 9 months ago

Thanks. Appreciate it.

ff-sdesai commented 9 months ago

@medzin Can you please update the documentation also to indicate how the batching can be used?

medzin commented 9 months ago

PTAL: https://github.com/medzin/beam-postgres#reading-in-batches

ff-sdesai commented 9 months ago

Thanks