Closed kmishra9 closed 3 months ago
Thanks. This is a hard problem, also tracked in https://github.com/r-dbi/DBI/issues/252.
Loading large data works best if the data is near the server. Also, it helps to disable or delete things like indexes and constraints. To load large data efficiently, a little more than a single function call will be necessary. Procedures will vary vastly across databases.
For small data, these things don't matter that much, reliability is important, and dbWriteTable()
should just work.
The current Redshift implementation creates a huge SQL query that inserts all rows. As you noticed, this collides with Redshift's limit on the query size. To work around this, we need a better version of DBI::sqlAppendTable()
that returns chunks of SQL with a predefined maximum length.
It is "hard" to get data into redshift in ways that probably shouldn't be hard.
I think I said this in a way that doesn't indicate how much I appreciate all the work you guys have put into making DBI
and dbplyr
awesome to use and with tremendous cross-db compatibility. I only meant it feels harder to get things into Redshift, relative to other DBs (seemingly, at least haha... probably because copy_to()
doesn't work as well).
Agree w/ your assessment of the problem & the most reasonable solution!
Do you think a different function or even graceful fallback of dbWriteTable()
that relies on an upload to S3 first then a native Redshift COPY
command for large datasets would ever be in scope to implement? This would be more complex than the existing implementation from a "process" and "number of things that could go wrong" perspective... but it would also could be be more efficient than generating a massive SQL query with all of the data.
Missed the question, sorry.
I think the upload to S3 plus COPY
should live elsewhere. It's too different from what this package does (basically, wrapping libpq).
Hey there,
I'd run into some issues with trying to upload data to redshift using RPostgres and non-Redshift-specific drivers a couple years ago and developed a workaround then that relied on pushing data to S3 first and then copying it into Redshift.
That solution utilizes the somehow-still-working RedshiftTools package, so when I was doing some refactoring, I was eager to see if DBI::dbWriteTable had made any progress on this project in the couple years since, and figured I'd toss a reprex your way with any bugs I saw.
It is "hard" to get data into redshift in ways that probably shouldn't be hard. Both
dplyr::copy_to()
anddbplyr::dbWriteTable()
don't work as expected for me, and I'm forced to rely on a hackier workaround than I'd like to.If one outcome of this bug report is to say "You should use use
dbWriteTable()
to upload a CSV instead of DF, that's totally fine w/ me, but that part should probably be fixed to work w/ schema naming andDBI::Id
just like the "lite" version with the small DF does.Created on 2023-03-30 with reprex v2.0.2