Closed ff-sdesai closed 9 months ago
Unfortunately, there is no way to configure reading in batches at the moment (there is only ReadAllFromPostgres
function that assumes that everything will be returned in one database query). We could add ReadFromPostgres
function that will read data in batches. It should be pretty straightforward if we assume that we don't need to handle updates/inserts in the meantime.
You can also try to beam.Reshuffle()
after the database read to redistribute data among all workers.
It won't automatically re-distribute the read records among all workers without beam.Reshuffle()
? Also, if there are lacs of records in DB table, it will result in query taking too much time?
How much time will it take for you to add the function to read data in batches?
I will need one/two weeks to add that feature (I have a lot of other work to do before).
Thanks. One/two weeks should be fine with me. After you add support for batching which of the following two ways it will work?
I will implement it as BoundedSource
so the runner can decide the bundle size and the reading speed. I can add an extra option to limit batching if the runner will set the value too high for the database to handle.
Yes, that would be nice
@medzin Just wanted to check did you get a chance to start on this one?
@medzin We have integrated this library as input connector in our Apache beam pipeline. Can you please confirm if you have started on adding support for batched input reading?
@ff-sdesai started working on the code here: https://github.com/medzin/beam-postgres/tree/issue/2
Thanks. Appreciate it.
@medzin Can you please update the documentation also to indicate how the batching can be used?
Thanks
If I have 100k records in my table, will it first fetch all 100k first and then go to next step for each record? Also, is there a way to configure a batch size for data fetch say 10k?