r-dbi / odbc

Connect to ODBC databases (using the DBI interface)
https://odbc.r-dbi.org/
Other
392 stars 107 forks source link

dplyr::copy_to() performance with odbc 1.5.0 #862

Open DavisVaughan opened 1 week ago

DavisVaughan commented 1 week ago

See https://github.com/tidyverse/dplyr/issues/7103

simonpcouch commented 1 week ago

Looping in @nathanhaigh—thanks for the issue. Related to #774 and #391. We've been iterating on this default over time and looks like the most recent changes introduce a slowdown in that context (Snowflake, larger data) you've noted.

Maybe we could introduce a batch_rows() helper with DBMS-specific methods.

detule commented 1 week ago

Hey @simonpcouch

I see that @nathanhaigh has discovered how he can use the package options to adjust the batch size. Following along his ( excellent ) write-up it seems like he is mostly curious why the change to the default was made.

I'll just chime in with motivation - I know you are aware of this since you addressed the issue originally, but posting here for general benefit. I think the pull request came as a result of at least one person complaining that with the previous set of parameters, RAM usage was excessive. It makes sense, when inserting a large data set in a single batch, we populate potentially very big buffers and pass them onto the driver.

We should probably have a more sophisticated default than 1024. For example if inserting a skinny table of numerical data, we can probably afford to grab many (all) rows at once. If, on the other hand, we have a wide table with LONG fields, we may need to adjust the number of rows according to some gradient. This type of analysis would in-and-of-itself come as a performance hit, but it sounds like it may be small compared to what some users (that are less memory constrained) may be experiencing.

nathanhaigh commented 1 week ago

I am curious about the change as it seems a very low default - perhaps that says more about the data science space I operate in but I would have thought tables of 10's-100's thousands of rows wouldn't be considered particularly big. Inserting in batches of 1024 rows would then require 10's-100's of insert statements. In the end, 1024 rows may only represent a very small amount of data so a more intelligent approach based on data size (e.g. chunking to xMB) might be a better approach. Perhaps one option might be to determine the number of bytes (object.size(my_table)) used by the table together with the number of rows to figure out an appropriate number for batch_rows.

@detule The issue you link to above seems to be more about a memory leak rather than memory usage. In that sense would batch_rows seems to be more about limiting memory usage, perhaps with the side effect of reducing memory leakage?

From a dplyr user perspective, setting the odbc.batch_rows option was not an obvious solution as it buried a few layers down and it took me some time to discover it as an potential solution to the performance issues I was observing. @simonpcouch - How would the following statement manifest itself to the user?

we could introduce a batch_rows() helper with DBMS-specific methods