tidyverse / dplyr

dplyr: A grammar of data manipulation
https://dplyr.tidyverse.org/
Other
4.75k stars 2.12k forks source link

database identifier case (in)sensitivity and quotations in sql #1321

Closed einarhjorleifsson closed 9 years ago

einarhjorleifsson commented 9 years ago

Question/request: Is it possible to have an option within dplyr for creating an sql string without case sensitivity enforcement (quotation) on identifiers when extracting data from database tables?

Background below:

According to this overview of case sensitivity of identifiers among different database platforms things are a bit of a mess in the sql world. E.g. in Postgres the default is to put identifiers (table and field names) to lower case (if quotation is not specified) while in Oracle it is put to uppercase.

Lets look into this using postgres, creating a table and adding some records via psql. In the terminal window:

> psql
hafstormur/home/einarhj [1002] psql
psql (9.3.6)
Type "help" for help.
einarhj-> \c test
You are now connected to database "test" as user "einarhj".
test=> CREATE TABLE Test_Table (Id INTEGER, Name varchar);
CREATE TABLE
test=> \d
           List of relations
 Schema |    Name    | Type  |  Owner  
--------+------------+-------+---------
 public | test_table | table | einarhj
(1 row)
test=> \d test_table
       Table "public.test_table"
 Column |       Type        | Modifiers 
--------+-------------------+-----------
 id     | integer           | 
 name   | character varying |
test=> INSERT INTO test_table VALUES(10,'Einar');
INSERT 0 1
test=> INSERT INTO test_table VALUES(20,'Bjorn');
INSERT 0 1
test=> SELECT Name, Id FROM Test_Table;
 name  | id 
-------+----
 Einar | 10
 Bjorn | 20
(2 rows)

Ergo: If no quotation is specfied all identifiers are converted to lower case.

Within the dplyr using quotation around all identifiers is the standard. I.e. the only syntax that works to access the table created above is the following:

library(dplyr)
db <- src_postgres("test")
tbl(db, "test_table")
## Source: postgres 9.3.6 [einarhj@localhost:5432/test]
## From: test_table [2 x 2]
## 
##   id  name
## 1 10 Einar
## 2 20 Bjorn
q1 <- tbl(db, "test_table") %>%
  select(name, id)
explain(q1)

Related to this is this oddity (same sql command as run above from withing psql):

q2 <- tbl(db, sql("select Id, Name from Test_Table"))
explain(q2)
## <SQL>
## SELECT "id", "name"
## FROM (select Id, Name from Test_Table) AS "_W1"
## 
## 
## <PLAN>
## Seq Scan on test_table  (cost=0.00..22.30 rows=1230 width=36)
q2
## Source: postgres 9.3.6 [einarhj@localhost:5432/test]
## From: <derived table> [?? x 2]
## 
##    id  name
## 1  10 Einar
## 2  20 Bjorn
## .. ..   ...

In Oracle everthing is reversed, the default case on identifiers is uppercase. Hence for an equivalently created table in Oracle as was done in the psql (ignore that the field names are not exactly the same) one would need to do the following within the dplyr framework:

library(dplyr)
library(ROracle, lib.loc = "/usr/local/lib/R/site/3.2/x86_64/library")
## Warning: package 'ROracle' was built under R version 3.2.0
library(dplyrOracle)
db <- src_oracle("mar")
tbl(db, "TEST_TABLE")
## Source: Oracle 11.2.0.3.0 [@mar:/mar]
## From: TEST_TABLE [2 x 2]
## 
##   ID              SURNAME
## 1 10 Einar               
## 2 20 Bjorn
q1 <- tbl(db, "TEST_TABLE") %>%
  select(SURNAME, ID)
explain(q1)
## <SQL> EXPLAIN PLAN FOR SELECT "SURNAME" AS "SURNAME", "ID" AS "ID"
## FROM "TEST_TABLE"

Again, we still have the following oddity:

q2 <- tbl(db, sql("select Id, Surname from Test_Table"))
explain(q2)
## <SQL>
## SELECT "ID", "SURNAME"
## FROM (select Id, Surname from Test_Table) "_W2"
## <SQL> EXPLAIN PLAN FOR SELECT "ID", "SURNAME"
## FROM (select Id, Surname from Test_Table) "_W2"
## <PLAN>
## TRUE
q2
## Source: Oracle 11.2.0.3.0 [@mar:/mar]
## From: <derived table> [?? x 2]
## 
##    ID              SURNAME
## 1  10 Einar               
## 2  20 Bjorn               
## .. ..                  ...
devtools::session_info()
## Session info --------------------------------------------------------------
##  setting  value                       
##  version  R version 3.1.2 (2014-10-31)
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  tz       Atlantic/Reykjavik          
##  date     2015-08-14
## Packages ------------------------------------------------------------------
##  package     * version date       source                                 
##  assertthat    0.1     2013-12-06 CRAN (R 3.0.1)                         
##  DBI         * 0.3.1   2014-09-24 CRAN (R 3.1.2)                         
##  devtools      1.9.0   2015-08-10 Github (hadley/devtools@2881db5)       
##  digest        0.6.8   2014-12-31 CRAN (R 3.1.2)                         
##  dplyr       * 0.4.2   2015-06-16 CRAN (R 3.1.2)                         
##  dplyrOracle * 0.0.1   2015-07-24 Github (tomasgreif/dplyrOracle@3a39f8e)
##  evaluate      0.7     2015-04-21 CRAN (R 3.1.2)                         
##  formatR       1.2     2015-04-21 CRAN (R 3.1.2)                         
##  htmltools     0.2.6   2014-09-08 CRAN (R 3.1.1)                         
##  knitr         1.10.12 2015-06-22 Github (yihui/knitr@1df6eee)           
##  lazyeval      0.1.10  2015-01-02 CRAN (R 3.1.2)                         
##  magrittr      1.5     2014-11-22 CRAN (R 3.1.2)                         
##  memoise       0.2.1   2014-04-22 CRAN (R 3.1.2)                         
##  R6            2.1.0   2015-07-04 CRAN (R 3.1.2)                         
##  Rcpp          0.12.0  2015-07-25 CRAN (R 3.1.2)                         
##  rmarkdown     0.7     2015-06-22 Github (rstudio/rmarkdown@82fde67)     
##  ROracle     * 1.1-12  2014-11-20 CRAN (R 3.2.0)                         
##  RPostgreSQL   0.4     2013-03-27 CRAN (R 3.1.2)                         
##  stringi       0.5-5   2015-06-29 CRAN (R 3.1.2)                         
##  stringr       1.0.0   2015-04-30 CRAN (R 3.1.2)                         
##  yaml          2.1.13  2014-06-12 CRAN (R 3.1.2)
hadley commented 9 years ago

The SQL syntax translation in dplyr is designed to make tables work like data frames (i.e. case sensitive).