r-dbi / RPostgres

A DBI-compliant interface to PostgreSQL
https://rpostgres.r-dbi.org
Other
328 stars 78 forks source link

Option to convert "enum" types to factor or character? #455

Open DavisVaughan opened 6 months ago

DavisVaughan commented 6 months ago

My wife has a scenario where she's pulling in a number of columns that end up with a class of "pq_stuff_enum". She doesn't really care that its an enum, and would really like it to come through as a character or possibly a factor.

It seems like typcategory has a category code E for enums: https://www.postgresql.org/docs/current/catalog-pg-type.html

I wonder if we can detect that when pulling the column in? Maybe around here? https://github.com/r-dbi/RPostgres/blob/f15f8f445def9e24ff98e4bfc910efcf705c85aa/src/PqResultImpl.cpp#L115

I could see a dbConnect() argument with something like enum = c("enum", "character", "factor") or something similar as a way to control how these are handled

krlmlr commented 6 months ago

Thanks. Reprex:

library(RPostgres)

con <- postgresDefault()
dbExecute(con, "CREATE TYPE mood AS ENUM ('sad', 'ok', 'happy')")
#> [1] 0
dbExecute(con, "CREATE TABLE x (a mood)")
#> [1] 0
dbExecute(con, "INSERT INTO x VALUES ('sad')")
#> [1] 1
dbExecute(con, "INSERT INTO x VALUES ('ok')")
#> [1] 1

out <- dbGetQuery(con, "SELECT * FROM x")
dput(out)
#> structure(list(a = structure(c("sad", "ok"), class = "pq_NA")), row.names = c(NA, 
#> -2L), class = "data.frame")
out <- dbGetQuery(con, "SELECT CAST(a AS text) FROM x")
dput(out)
#> structure(list(a = c("sad", "ok")), class = "data.frame", row.names = c(NA, 
#> -2L))

dbExecute(con, "DROP TABLE IF EXISTS x")
#> [1] 0
dbExecute(con, "DROP TYPE IF EXISTS mood")
#> [1] 0

Created on 2024-01-22 with reprex v2.0.2

Are you requesting the removal of the "pq_mood" class? (For some reason, the class is "pq_NA" in this example, I wonder why.) This is easy to do either by the client (use unclass() for the relevant columns) or by reformulating your query as I did.

Would you like to share the use case behind this request?

DavisVaughan commented 6 months ago

She uses dbplyr, so she has little control over the underlying query and is eventually just calling collect() to bring the data frame into R.

It's mostly a problem when you start using it like a character vector with other parts of the tidyverse, like:

x <- structure(c("sad", "ok"), class = "pq_NA")

vctrs::vec_c(x, "stuff")
#> Error in `vctrs::vec_c()`:
#> ! Can't combine `..1` <pq_NA> and `..2` <character>.

It seems reasonable to me that we should try and map as many Postgres types to native R types as we can, and mapping an enum to a character or factor seemed pretty natural

This is easy to do either by the client (use unclass() for the relevant columns)

My argument is that this extra class serves no purpose for her (and most users, I'd imagine), and it would be better off as a bare character vector or a factor

krlmlr commented 6 months ago

That's not wrong. For now, a possible workaround might be something along the lines of

collect <- function(x, ...) {
  out <- dplyr::collect(x, ...)
  out[] <- map(out, ~ if (is.character(.x)) unclass(.x) else .x)
  out
}

Would it help if all enum classes had a common superclass so that we could define casting/coercion operators for vctrs?

Adding a special dbConnect() option looks like a hasty kludge. Let's think about it, in particular in the context of other database drivers that might need to deal with enums, like duckdb.

DavisVaughan commented 6 months ago

Is there a reason enums don't come through as factors? That feels like a 1:1 match to me

krlmlr commented 6 months ago

No particular reason, but I'm not sure if the levels are available as part of the result.

DavisVaughan commented 6 months ago

It seems like you look the labels up in an overarching pg_enum catalog based on the owning type's id or something like that https://www.postgresql.org/docs/current/catalog-pg-enum.html

krlmlr commented 6 months ago

It's interesting that "pq_NA" is set only for enums that are created as part of the current connection (last reprex). After reconnecting, the class is "pq_mood" :

library(RPostgres)

con <- postgresDefault()
dbExecute(con, "CREATE TYPE mood AS ENUM ('sad', 'ok', 'happy')")
#> [1] 0
dbDisconnect(con)

con <- postgresDefault()
dbExecute(con, "CREATE TABLE x (a mood)")
#> [1] 0
dbExecute(con, "INSERT INTO x VALUES ('sad')")
#> [1] 1
dbExecute(con, "INSERT INTO x VALUES ('ok')")
#> [1] 1

out <- dbGetQuery(con, "SELECT * FROM x")
dput(out)
#> structure(list(a = structure(c("sad", "ok"), class = "pq_mood")), row.names = c(NA, 
#> -2L), class = "data.frame")
out <- dbGetQuery(con, "SELECT CAST(a AS text) FROM x")
dput(out)
#> structure(list(a = c("sad", "ok")), class = "data.frame", row.names = c(NA, 
#> -2L))

dbExecute(con, "DROP TABLE IF EXISTS x")
#> [1] 0
dbExecute(con, "DROP TYPE IF EXISTS mood")
#> [1] 0
dbDisconnect(con)

Created on 2024-01-23 with reprex v2.0.2

And we already fetch a @typnames component as part of dbConnect() , we could do the same for the enum levels. Although now I'm concerned that this already adds a non-negligible overhead for establishing a connection. Likely would have to be opt-in.

If the enums were factors, we would get cast and coercion behavior for free.

@paleolimbot: How do you (intend to) handle enum types in ADBC for Postgres?

paleolimbot commented 6 months ago

It looks like they're not handled yet!

library(adbcdrivermanager)

db <- adbc_database_init(
  adbcpostgresql::adbcpostgresql(),
  uri = "postgresql://localhost:5432/postgres?user=postgres&password=password") 

db |> 
  execute_adbc("CREATE TYPE mood AS ENUM ('sad', 'ok', 'happy')")

con <- db |> adbc_connection_init()

con |> 
  execute_adbc("CREATE TABLE x (a mood)") |> 
  execute_adbc("INSERT INTO x VALUES ('sad')") |> 
  execute_adbc("INSERT INTO x VALUES ('ok')")

con |> 
  read_adbc("SELECT * from x")
#> Error in adbc_statement_execute_query(stmt, stream, stream_join_parent = TRUE): [libpq] Column #1 ("a") has unknown type code 16385

Created on 2024-01-23 with reprex v2.0.2

paleolimbot commented 6 months ago

Just kidding, I forgot about the part where we cache the types in the database. It's returned as binary but the COPY output is character, so that's how the driver will return it:

library(adbcdrivermanager)

db <- adbc_database_init(
  adbcpostgresql::adbcpostgresql(),
  uri = "postgresql://localhost:5432/postgres?user=postgres&password=password") 

db |> 
  execute_adbc("CREATE TYPE mood AS ENUM ('sad', 'ok', 'happy')")

db |> 
  adbc_database_release()

# Have to recreate the database after because that's where we cache the types
con <- db <- adbc_database_init(
    adbcpostgresql::adbcpostgresql(),
    uri = "postgresql://localhost:5432/postgres?user=postgres&password=password"
  ) |> 
  adbc_connection_init()

con |> 
  execute_adbc("CREATE TABLE x (a mood)") |> 
  execute_adbc("INSERT INTO x VALUES ('sad')") |> 
  execute_adbc("INSERT INTO x VALUES ('ok')")

# Currently returned as binary (i.e., COPY output)
con |> 
  read_adbc("SELECT * from x") |> 
  as.data.frame() |> 
  dplyr::pull(a) |> 
  unclass() |> 
  vapply(rawToChar, character(1))
#> [1] "sad" "ok"

Created on 2024-01-23 with reprex v2.0.2

paleolimbot commented 6 months ago

Sorry for hijacking, but after https://github.com/apache/arrow-adbc/pull/1485 it'll be:

library(adbcdrivermanager)

db <- adbc_database_init(
  adbcpostgresql::adbcpostgresql(),
  uri = "postgresql://localhost:5432/postgres?user=postgres&password=password") 

db |> 
  execute_adbc("CREATE TYPE mood AS ENUM ('sad', 'ok', 'happy')")

db |> 
  adbc_database_release()

# Have to recreate the database after because that's where we cache the types
con <- db <- adbc_database_init(
    adbcpostgresql::adbcpostgresql(),
    uri = "postgresql://localhost:5432/postgres?user=postgres&password=password"
  ) |> 
  adbc_connection_init()

con |> 
  execute_adbc("CREATE TABLE x (a mood)") |> 
  execute_adbc("INSERT INTO x VALUES ('sad')") |> 
  execute_adbc("INSERT INTO x VALUES ('ok')")

# Returns as string because that's how the driver sees it
con |> 
  read_adbc("SELECT * from x") |> 
  as.data.frame()
#>     a
#> 1 sad
#> 2  ok

# If you know the R type you can specify it
con |> 
  read_adbc("SELECT * from x") |> 
  nanoarrow::convert_array_stream(
    to = data.frame(x = factor(levels = c("mood", "sad", "ok")))
  ) |> 
  str()
#> 'data.frame':    2 obs. of  1 variable:
#>  $ x: Factor w/ 3 levels "mood","sad","ok": 2 3

Created on 2024-01-23 with reprex v2.0.2

dpprdan commented 6 months ago

@paleolimbot engaging with your hijacking (sorry, not sorry 😬):

Would a reverse conversion, i.e. character/factor to an existing enum with write_adbc(mode = "append"), be possible, too?

Would an update cache function be useful (instead of having to (de)connect)?