r-dbi / dbi3

DBI revisited
https://r-dbi.github.io/dbi3
37 stars 2 forks source link

let dbreadtable use copy #9

Open parisni opened 3 years ago

parisni commented 3 years ago

dbreadtable could use copy to csv and then read the csv to dataframe.

this needs a temporary folder, but this would improve performance in many cases.

also several R readers could be passed such fread to get different object such dataframe of datatable or whatever

krlmlr commented 2 years ago

Thanks. In what situation would COPY be faster? Would you like to share an example and timings?

parisni commented 2 years ago

well from my experience, copy is more effective in mostly any case. You can notice the performance improvement on large enough tables (milion rows). An other advantage is it has a lower impact on the postgres database compared with classic method, with no cursor needs.

A simple way to prove this is to benchmark on several tables sizes:

Also copy has a drawback: it losts columns types in the process, so you have to get this information from the database and inject it in the csv reader afterwards. Moreover you cannot deal with bytea fields, nor with arrays (dependently of the csv reader capabilities).

For information, the psycopg3 python library, is a full rewrite of psycopg2 focussed on copy for any interaction with postgres https://www.psycopg.org/psycopg3/

krlmlr commented 2 years ago

Good point about column types. I think we should stick with the current approach for dbReadTable(). Unfortunately, COPY table_name TO STDOUT currently hangs R, we could mitigate this with a new postgresCopyTo() function.

krlmlr commented 2 years ago

An integration with Arrow might bring us much more bang for the buck, including type safety.