r-dbi / dbi3

DBI revisited
https://r-dbi.github.io/dbi3
36 stars 2 forks source link

dbGetChunkedQuery() #23

Open krlmlr opened 7 years ago

krlmlr commented 7 years ago

The goal is to support fetching and processing data piecemeal with a callback. Development started in r-dbi/DBI#111, but perhaps the interface and the requirements should be specified here first.

CC @bborgesr @jcheng5 @jimhester @hadley.

talegari commented 7 years ago

@krlmlr Can the state the purpose of the function precisely? Do we intent to do split-apply-combine here? What does callback mean?(Sorry, I am a non native speaker)

jcheng5 commented 7 years ago

Split-apply-combine is one scenario that this style would support. Another is early termination (looking for a needle in a haystack--once you find it, you can stop fetching). Another is extract-transform-load where the data is larger than memory (or any other operation on the data that is performed for its side effect).

All of these can be performed today by dbSendQuery and dbFetch but a callback style API is harder to make mistakes (primarily I'm thinking of leaked connections) and can also be a common pattern that we offer not only in DBI but any source of row oriented data in R.

krlmlr commented 7 years ago

All of these can be performed today by dbSendQuery and dbFetch

... then it may as well live in another package?

can also be a common pattern

... then it should perhaps live in another package?

If we have a generic code that defines a data source as an R6 class (such as https://github.com/krlmlr/pumpr), we can add a wrapper for DBI connections there. Would that work?

krlmlr commented 4 years ago

Do we have a good way to process data piecemeal by now? Is there anything that needs to be done in DBI?

zyxdef commented 4 years ago

Hoping to be pardoned for barging in the discussion, memory swap is a true killer when coming down to processing remote large datasets. So, a function that should be quite handy would be dbSuggestN(), which retrieves the current available physical memory (which may, of course, float) and the record size in server (perhapes taking into account some network overhead) and suggests a reasonable value for (which fits in, say, 80% of the current free memory), in the hope that the chances of incurring in memory swaps reduce significantly.

krlmlr commented 3 years ago

@zyxdef: Interesting idea in the context of chunked processing.

I now think this is out of scope for DBI.

github-actions[bot] commented 2 years ago

This old thread has been automatically locked. If you think you have found something related to this, please open a new issue and link to this old issue if necessary.