rcconsult / rpostgresql

Automatically exported from code.google.com/p/rpostgresql
0 stars 0 forks source link

Problem dbWriteTable() fails with backslashes embedded in text fields #31

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
> # Test file attached.
> test <- read.csv("~/Desktop/test.csv")
> test
         Company_Name Stock_Price
1              Big Co       23.21
2            Old Corp       33.16
3 Slasher Inc. \\DE\\       55.55
4            Last One       13.13
> dbWriteTable(pg,"test",test)
[1] TRUE
> dbGetQuery(pg,"SELECT * FROM test")
data frame with 0 columns and 0 rows
> # A regular expression gsub() that "double-doubles" backslashes seems to fix 
this
> # Note that it shows as a single backslash on the PostgreSQL side.
> test <- 
as.data.frame(sapply(test,gsub,pattern="\\\\",replacement="\\\\\\\\",perl=TRUE))
> dbWriteTable(pg,"test",test,overwrite=TRUE)
[1] TRUE
> dbGetQuery(pg,"SELECT * FROM test")
  row_names        Company_Name Stock_Price
1         1              Big Co       23.21
2         2            Old Corp       33.16
3         3 Slasher Inc. \\DE\\       55.55
4         4            Last One       13.13

What is the expected output? What do you see instead?
Expected output above after gsub() fix. Otherwise table is empty.

What version of the product are you using? On what operating system?
> R.version
               _                            
platform       x86_64-apple-darwin9.8.0     
arch           x86_64                       
os             darwin9.8.0                  
system         x86_64, darwin9.8.0          
status                                      
major          2                            
minor          13.0                         
year           2011                         
month          04                           
day            13                           
svn rev        55427                        
language       R                            
version.string R version 2.13.0 (2011-04-13)

Please provide any additional information below.

Original issue reported on code.google.com by iand...@gmail.com on 26 Apr 2011 at 2:57

Attachments:

GoogleCodeExporter commented 9 years ago
Note that the test data is:

Company_Name,Stock_Price
Big Co,23.21
Old Corp,33.16
Slasher Inc. \DE\,55.55
Last One,13.13

So read.csv() is adding extra "\"s.

Properly escaping and cleaning input data prior to importing is the 
responsibility of the user.

The bug here seems to be that dbWriteTable() did not report an error (FALSE) 
with the write.  I've changed the Summary to reflect this.

Can someone verify if this a problem on windows or unix, or should I assume 
that it only occurs on Mac?

Original comment by ne...@neiltiffin.com on 28 Apr 2011 at 3:15

GoogleCodeExporter commented 9 years ago
I can't confirm that the problem occurs on a PC or Linux. But I agree that it's 
read.csv, not RPostgreSQL; sorry for not checking. This data pulled through 
read.csv()--and I'd guess read.table()--causes problems with both SQLite and 
MySQL.

Thanks.

Original comment by iand...@gmail.com on 28 Apr 2011 at 3:28

GoogleCodeExporter commented 9 years ago
The silent fail appears to be fixed as of SVN r181 and maybe earlier.  SVN r173 
claims to fix a similar issue.  As of r181 this reports as;

> dbWriteTable(pg,"test",test)
Error in postgresqlgetResult(new.con) : 
  RS-DBI driver: (could not Retrieve the result : ERROR:  missing data for column "Stock_Price"
CONTEXT:  COPY test, line 3: "3 Slasher Inc. \DE\   55.55"
)

Now whether it should fail or not is a different question that I will defer to 
others.  I've also changed the title again.

Original comment by ne...@neiltiffin.com on 28 Apr 2011 at 3:47

GoogleCodeExporter commented 9 years ago
I dabbled further with this and think that read.csv() may not be the problem. 
While print() shows \\, cat() suggests only \. 

> test <- read.csv("~/Downloads/test.csv",stringsAsFactors=FALSE)
> test
         Company_Name Stock_Price
1              Big Co       23.21
2            Old Corp       33.16
3 Slasher Inc. \\DE\\       55.55
4            Last One       13.13
> test[3,1]
[1] "Slasher Inc. \\DE\\"
> cat(test[3,1])
Slasher Inc. \DE\

I commented out "on.exit(unlink(fn), add = TRUE)" in PostgreSQLSupport.R to see 
the output being sent over to PostgreSQL. Here it is (basically what R was 
given to start with):

1 Big Co  23.21
2 Old Corp  33.16
3 Slasher Inc. \DE\ 55.55
4 Last One  13.13

So only a single backslash is being sent over. Thus a user who wants to send 
over data with embedded backslashes would need to "fix" these in R before doing 
so, even though these seem fine within R. Below is my kluge, which makes a mess 
of the data--turning it all into characters before fixing on the PosgreSQL 
side. [Note that this also "fixes" embedded double-quotes, but I've confirmed 
that this *is* an issue with getting my particular CSV files into R ... if 
these are "escaped" as "" in the text file, then R gets " and sends to 
PostgreSQL accordingly.]

dbWriteTable.plus <- function(conn, name, value, row.names = TRUE, 
                              text.to.char=TRUE, fix.names=TRUE,
                              ..., overwrite = FALSE, append = FALSE) {  

  # Fix backslashes and double-quotes
  temp <- as.data.frame(sapply(value,gsub,
    pattern="\\\\",replacement="\\\\\\\\",perl=TRUE),stringsAsFactors=FALSE)
  temp <- as.data.frame(sapply(temp,gsub,
    pattern="\\\"",replacement="\\\\\"",perl=TRUE),stringsAsFactors=FALSE)

  # If asked to do so in the call, make names lower case to avoid
  # need to quote variable names in SQL queries if mixed case is present.
  names(value) <- names(temp) <- make.db.names(conn,names(value))
  if (fix.names) {
    names(value) <- names(temp) <- gsub("(.*)","\\L\\1",names(temp), perl=TRUE)
  }

  # Call the underlying RPostgreSQL function
  rs <- dbWriteTable(conn, name, value=temp, row.names = row.names,
              overwrite=overwrite, append=append)

  # Convert fields to appropriate types. This is necessary because the
  # gsub() calls above converted everything to characters. But the following
  # also retains Dates, which the basic function does not(?) do, and allows
  # character vectors to be converted to VARCHAR() in PostgreSQL rather than TEXT.
  # An alternative would seem to be to cast the elements of the data.frame
  # back to the appropriate types before sending to PostgreSQL.
  for (i in names(value) ) {
    # Text to VARCHAR (unless asked not to do so)
    if (is.character(value[,i]) && text.to.char) {
      sql <- paste("ALTER TABLE ",name," ALTER COLUMN \"",i,
                   "\" TYPE varchar(", 
                   max(nchar(encodeString(value[,i]))),")",sep="")
      dbGetQuery(conn,sql)

    } else if (is.integer(value[,i])) {
      # Integers
      sql <- paste("ALTER TABLE ",name," ALTER COLUMN \"",i,"\" TYPE bigint ",
        "USING CAST(\"",i,"\" AS bigint)",sep="")
      dbGetQuery(conn,sql)

    } else if (class(value[,i])=="Date") {
      # Dates. Needs to be done before conversion of doubles (below)
      sql <- paste("ALTER TABLE ",name," ALTER COLUMN \"",i,"\" TYPE date ",
        "USING to_date(\"",i,"\", 'YYYY-MM-DD')",sep="")
      dbGetQuery(conn,sql)

    } else if(is.double(value[,i])) {
      # Doubles
      sql <- paste("ALTER TABLE ",name," ALTER COLUMN \"",i,
        "\" TYPE double precision",
        " USING CAST(\"",i,"\" AS  double precision)",sep="")
      dbGetQuery(conn,sql)
    }  
  }  

  # Vacuum the table (a need arises because of changes to data types, I guess)
  dbGetQuery(conn,paste("VACUUM",name))

  # Return the value returned by dbWriteTable
  return(rs)
}

Original comment by iand...@gmail.com on 30 Apr 2011 at 3:25

Attachments:

GoogleCodeExporter commented 9 years ago
Here is an illustration of how dbWriteTable seems to depart from "expected 
behavior." 

Idea: Pull data, dbWriteTable back to PostgreSQL, pull data again.

With RPostgreSQL r181, it doesn't always work. Backslashes get messed up. Nice 
feature of r181 is that this failure is apparent. My write.csv/psql "COPY" 
kluge works, however. Not sure what is happening to cause this.

Output below (note that I wanted to be extra careful that I'd killed other 
versions of RPostgreSQL that I had installed).

> remove.packages("RPostgreSQL")
Removing package(s) from ‘/home/iangow/R/x86_64-pc-linux-gnu-library/2.13’
(as ‘lib’ is unspecified)
> libs <- as.data.frame(installed.packages())
> libs[libs$Package=="RPostgreSQL",]
 [1] Package   LibPath   Version   Priority  Depends   Imports   LinkingTo
 [8] Suggests  Enhances  OS_type   License   Built    
<0 rows> (or 0-length row.names)
> install.packages("~/Downloads/RPostgreSQL/", repos=NULL, 
lib=.Library.site[[1]])
* installing *source* package ‘RPostgreSQL’ ...
checking for gcc... gcc
checking for C compiler default output file name... a.out
checking whether the C compiler works... yes
checking whether we are cross compiling... no
checking for suffix of executables... 
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether gcc accepts -g... yes
checking for gcc option to accept ISO C89... none needed
checking for pg_config... /usr/bin/pg_config
checking for "/usr/include/postgresql/libpq-fe.h"... yes
configure: creating ./config.status
config.status: creating src/Makevars
** libs
make: Nothing to be done for `all'.
installing to /usr/local/lib/R/site-library/RPostgreSQL/libs
** R
** inst
** preparing package for lazy loading
Creating a new generic function for "format" in "RPostgreSQL"
Creating a new generic function for "print" in "RPostgreSQL"
** help
*** installing help indices
** building package indices ...
** testing if installed package can be loaded

* DONE (RPostgreSQL)
> library(RPostgreSQL)
Loading required package: DBI
> drv <- dbDriver("PostgreSQL")
> pg <- dbConnect(drv, dbname = "crsp") 
> test.data <- dbGetQuery(pg,"SELECT DISTINCT company FROM 
executive.executive") 
> dim(test.data)
[1] 5806    1
> dbWriteTable(pg,"test_data", test.data, overwrite=TRUE)
Error in postgresqlgetResult(new.con) : 
  RS-DBI driver: (could not Retrieve the result : ERROR:  extra data after last expected column
CONTEXT:  COPY test_data, line 166: "166    PERFORMANCE TECHNOLOGIES INC \DE\
167 NEWCOR INC"
)
> dbGetQuery(pg,"DROP TABLE test_data")
> source("/home/iangow/Dropbox/AGL/Code/R/dbWriteTable.fast.R")
> dbWriteTable.fast(pg,"test_data", test.data)
Password for user iangow: 
[1] TRUE
> test.data.2 <- dbGetQuery(pg,"SELECT * FROM test_data")
> test.data.2[166,]
[1] "PERFORMANCE TECHNOLOGIES INC \\DE\\"
> dim(test.data.2)
[1] 5806    1
> cat(test.data.2[166,])
PERFORMANCE TECHNOLOGIES INC \DE\> 

Original comment by iand...@gmail.com on 11 May 2011 at 3:54

GoogleCodeExporter commented 9 years ago
What data is in executive.executive?

Original comment by ne...@neiltiffin.com on 11 May 2011 at 12:43

GoogleCodeExporter commented 9 years ago
Problematic data included in the attached.

Original comment by iand...@gmail.com on 11 May 2011 at 1:18

Attachments:

GoogleCodeExporter commented 9 years ago
According to
http://www.postgresql.org/docs/8.1/static/sql-copy.html

\b  Backspace (ASCII 8)
\f  Form feed (ASCII 12)
\n  Newline (ASCII 10)
\r  Carriage return (ASCII 13)
\t  Tab (ASCII 9)
\v  Vertical tab (ASCII 11)
are to be escaped in addition to \\ backslash.
At least, the need to escape for \t and \n is apparent.
To achieve this, the current code structure have to be largely changed.
Calling write.table() from R doesn't seem a good way.
Escaping backslash is not a trivial task due to Encoding like CP932 or 
Shift_JIS.
So, perhaps every code should be encoded to UTF-8.

Original comment by tomoa...@kenroku.kanazawa-u.ac.jp on 24 Aug 2011 at 12:21

GoogleCodeExporter commented 9 years ago
r182 does try to solve this issue.  Since the difference is quite large, I hope 
you there would do thorough tests, and report any failure.  

Original comment by tomoa...@kenroku.kanazawa-u.ac.jp on 26 Aug 2011 at 8:42

GoogleCodeExporter commented 9 years ago
Tomoaki:

I'm not sure I could say that I've done "thorough tests," but I've been using 
r182 without issue for a while now. It seems to have addressed the issues I was 
having with earlier versions.

Original comment by iand...@gmail.com on 25 Sep 2011 at 12:34

GoogleCodeExporter commented 9 years ago
Closed, thanks.

Original comment by tomoa...@kenroku.kanazawa-u.ac.jp on 10 Oct 2011 at 7:47