Open DavisVaughan opened 6 months ago
Thanks. Reprex:
library(RPostgres)
con <- postgresDefault()
dbExecute(con, "CREATE TYPE mood AS ENUM ('sad', 'ok', 'happy')")
#> [1] 0
dbExecute(con, "CREATE TABLE x (a mood)")
#> [1] 0
dbExecute(con, "INSERT INTO x VALUES ('sad')")
#> [1] 1
dbExecute(con, "INSERT INTO x VALUES ('ok')")
#> [1] 1
out <- dbGetQuery(con, "SELECT * FROM x")
dput(out)
#> structure(list(a = structure(c("sad", "ok"), class = "pq_NA")), row.names = c(NA,
#> -2L), class = "data.frame")
out <- dbGetQuery(con, "SELECT CAST(a AS text) FROM x")
dput(out)
#> structure(list(a = c("sad", "ok")), class = "data.frame", row.names = c(NA,
#> -2L))
dbExecute(con, "DROP TABLE IF EXISTS x")
#> [1] 0
dbExecute(con, "DROP TYPE IF EXISTS mood")
#> [1] 0
Created on 2024-01-22 with reprex v2.0.2
Are you requesting the removal of the "pq_mood"
class? (For some reason, the class is "pq_NA"
in this example, I wonder why.) This is easy to do either by the client (use unclass()
for the relevant columns) or by reformulating your query as I did.
Would you like to share the use case behind this request?
She uses dbplyr, so she has little control over the underlying query and is eventually just calling collect()
to bring the data frame into R.
It's mostly a problem when you start using it like a character vector with other parts of the tidyverse, like:
x <- structure(c("sad", "ok"), class = "pq_NA")
vctrs::vec_c(x, "stuff")
#> Error in `vctrs::vec_c()`:
#> ! Can't combine `..1` <pq_NA> and `..2` <character>.
It seems reasonable to me that we should try and map as many Postgres types to native R types as we can, and mapping an enum to a character or factor seemed pretty natural
This is easy to do either by the client (use unclass() for the relevant columns)
My argument is that this extra class serves no purpose for her (and most users, I'd imagine), and it would be better off as a bare character vector or a factor
That's not wrong. For now, a possible workaround might be something along the lines of
collect <- function(x, ...) {
out <- dplyr::collect(x, ...)
out[] <- map(out, ~ if (is.character(.x)) unclass(.x) else .x)
out
}
Would it help if all enum classes had a common superclass so that we could define casting/coercion operators for vctrs?
Adding a special dbConnect()
option looks like a hasty kludge. Let's think about it, in particular in the context of other database drivers that might need to deal with enums, like duckdb.
Is there a reason enums don't come through as factors? That feels like a 1:1 match to me
No particular reason, but I'm not sure if the levels are available as part of the result.
It seems like you look the labels up in an overarching pg_enum
catalog based on the owning type's id or something like that
https://www.postgresql.org/docs/current/catalog-pg-enum.html
It's interesting that "pq_NA"
is set only for enums that are created as part of the current connection (last reprex). After reconnecting, the class is "pq_mood"
:
library(RPostgres)
con <- postgresDefault()
dbExecute(con, "CREATE TYPE mood AS ENUM ('sad', 'ok', 'happy')")
#> [1] 0
dbDisconnect(con)
con <- postgresDefault()
dbExecute(con, "CREATE TABLE x (a mood)")
#> [1] 0
dbExecute(con, "INSERT INTO x VALUES ('sad')")
#> [1] 1
dbExecute(con, "INSERT INTO x VALUES ('ok')")
#> [1] 1
out <- dbGetQuery(con, "SELECT * FROM x")
dput(out)
#> structure(list(a = structure(c("sad", "ok"), class = "pq_mood")), row.names = c(NA,
#> -2L), class = "data.frame")
out <- dbGetQuery(con, "SELECT CAST(a AS text) FROM x")
dput(out)
#> structure(list(a = c("sad", "ok")), class = "data.frame", row.names = c(NA,
#> -2L))
dbExecute(con, "DROP TABLE IF EXISTS x")
#> [1] 0
dbExecute(con, "DROP TYPE IF EXISTS mood")
#> [1] 0
dbDisconnect(con)
Created on 2024-01-23 with reprex v2.0.2
And we already fetch a @typnames
component as part of dbConnect()
, we could do the same for the enum levels. Although now I'm concerned that this already adds a non-negligible overhead for establishing a connection. Likely would have to be opt-in.
If the enums were factors, we would get cast and coercion behavior for free.
@paleolimbot: How do you (intend to) handle enum types in ADBC for Postgres?
It looks like they're not handled yet!
library(adbcdrivermanager)
db <- adbc_database_init(
adbcpostgresql::adbcpostgresql(),
uri = "postgresql://localhost:5432/postgres?user=postgres&password=password")
db |>
execute_adbc("CREATE TYPE mood AS ENUM ('sad', 'ok', 'happy')")
con <- db |> adbc_connection_init()
con |>
execute_adbc("CREATE TABLE x (a mood)") |>
execute_adbc("INSERT INTO x VALUES ('sad')") |>
execute_adbc("INSERT INTO x VALUES ('ok')")
con |>
read_adbc("SELECT * from x")
#> Error in adbc_statement_execute_query(stmt, stream, stream_join_parent = TRUE): [libpq] Column #1 ("a") has unknown type code 16385
Created on 2024-01-23 with reprex v2.0.2
Just kidding, I forgot about the part where we cache the types in the database. It's returned as binary but the COPY output is character, so that's how the driver will return it:
library(adbcdrivermanager)
db <- adbc_database_init(
adbcpostgresql::adbcpostgresql(),
uri = "postgresql://localhost:5432/postgres?user=postgres&password=password")
db |>
execute_adbc("CREATE TYPE mood AS ENUM ('sad', 'ok', 'happy')")
db |>
adbc_database_release()
# Have to recreate the database after because that's where we cache the types
con <- db <- adbc_database_init(
adbcpostgresql::adbcpostgresql(),
uri = "postgresql://localhost:5432/postgres?user=postgres&password=password"
) |>
adbc_connection_init()
con |>
execute_adbc("CREATE TABLE x (a mood)") |>
execute_adbc("INSERT INTO x VALUES ('sad')") |>
execute_adbc("INSERT INTO x VALUES ('ok')")
# Currently returned as binary (i.e., COPY output)
con |>
read_adbc("SELECT * from x") |>
as.data.frame() |>
dplyr::pull(a) |>
unclass() |>
vapply(rawToChar, character(1))
#> [1] "sad" "ok"
Created on 2024-01-23 with reprex v2.0.2
Sorry for hijacking, but after https://github.com/apache/arrow-adbc/pull/1485 it'll be:
library(adbcdrivermanager)
db <- adbc_database_init(
adbcpostgresql::adbcpostgresql(),
uri = "postgresql://localhost:5432/postgres?user=postgres&password=password")
db |>
execute_adbc("CREATE TYPE mood AS ENUM ('sad', 'ok', 'happy')")
db |>
adbc_database_release()
# Have to recreate the database after because that's where we cache the types
con <- db <- adbc_database_init(
adbcpostgresql::adbcpostgresql(),
uri = "postgresql://localhost:5432/postgres?user=postgres&password=password"
) |>
adbc_connection_init()
con |>
execute_adbc("CREATE TABLE x (a mood)") |>
execute_adbc("INSERT INTO x VALUES ('sad')") |>
execute_adbc("INSERT INTO x VALUES ('ok')")
# Returns as string because that's how the driver sees it
con |>
read_adbc("SELECT * from x") |>
as.data.frame()
#> a
#> 1 sad
#> 2 ok
# If you know the R type you can specify it
con |>
read_adbc("SELECT * from x") |>
nanoarrow::convert_array_stream(
to = data.frame(x = factor(levels = c("mood", "sad", "ok")))
) |>
str()
#> 'data.frame': 2 obs. of 1 variable:
#> $ x: Factor w/ 3 levels "mood","sad","ok": 2 3
Created on 2024-01-23 with reprex v2.0.2
@paleolimbot engaging with your hijacking (sorry, not sorry 😬):
Would a reverse conversion, i.e. character/factor to an existing enum with write_adbc(mode = "append")
, be possible, too?
Would an update cache function be useful (instead of having to (de)connect)?
My wife has a scenario where she's pulling in a number of columns that end up with a class of
"pq_stuff_enum"
. She doesn't really care that its an enum, and would really like it to come through as a character or possibly a factor.It seems like
typcategory
has a category codeE
for enums: https://www.postgresql.org/docs/current/catalog-pg-type.htmlI wonder if we can detect that when pulling the column in? Maybe around here? https://github.com/r-dbi/RPostgres/blob/f15f8f445def9e24ff98e4bfc910efcf705c85aa/src/PqResultImpl.cpp#L115
I could see a
dbConnect()
argument with something likeenum = c("enum", "character", "factor")
or something similar as a way to control how these are handled