Closed GoogleCodeExporter closed 9 years ago
Note that the test data is:
Company_Name,Stock_Price
Big Co,23.21
Old Corp,33.16
Slasher Inc. \DE\,55.55
Last One,13.13
So read.csv() is adding extra "\"s.
Properly escaping and cleaning input data prior to importing is the
responsibility of the user.
The bug here seems to be that dbWriteTable() did not report an error (FALSE)
with the write. I've changed the Summary to reflect this.
Can someone verify if this a problem on windows or unix, or should I assume
that it only occurs on Mac?
Original comment by ne...@neiltiffin.com
on 28 Apr 2011 at 3:15
I can't confirm that the problem occurs on a PC or Linux. But I agree that it's
read.csv, not RPostgreSQL; sorry for not checking. This data pulled through
read.csv()--and I'd guess read.table()--causes problems with both SQLite and
MySQL.
Thanks.
Original comment by iand...@gmail.com
on 28 Apr 2011 at 3:28
The silent fail appears to be fixed as of SVN r181 and maybe earlier. SVN r173
claims to fix a similar issue. As of r181 this reports as;
> dbWriteTable(pg,"test",test)
Error in postgresqlgetResult(new.con) :
RS-DBI driver: (could not Retrieve the result : ERROR: missing data for column "Stock_Price"
CONTEXT: COPY test, line 3: "3 Slasher Inc. \DE\ 55.55"
)
Now whether it should fail or not is a different question that I will defer to
others. I've also changed the title again.
Original comment by ne...@neiltiffin.com
on 28 Apr 2011 at 3:47
I dabbled further with this and think that read.csv() may not be the problem.
While print() shows \\, cat() suggests only \.
> test <- read.csv("~/Downloads/test.csv",stringsAsFactors=FALSE)
> test
Company_Name Stock_Price
1 Big Co 23.21
2 Old Corp 33.16
3 Slasher Inc. \\DE\\ 55.55
4 Last One 13.13
> test[3,1]
[1] "Slasher Inc. \\DE\\"
> cat(test[3,1])
Slasher Inc. \DE\
I commented out "on.exit(unlink(fn), add = TRUE)" in PostgreSQLSupport.R to see
the output being sent over to PostgreSQL. Here it is (basically what R was
given to start with):
1 Big Co 23.21
2 Old Corp 33.16
3 Slasher Inc. \DE\ 55.55
4 Last One 13.13
So only a single backslash is being sent over. Thus a user who wants to send
over data with embedded backslashes would need to "fix" these in R before doing
so, even though these seem fine within R. Below is my kluge, which makes a mess
of the data--turning it all into characters before fixing on the PosgreSQL
side. [Note that this also "fixes" embedded double-quotes, but I've confirmed
that this *is* an issue with getting my particular CSV files into R ... if
these are "escaped" as "" in the text file, then R gets " and sends to
PostgreSQL accordingly.]
dbWriteTable.plus <- function(conn, name, value, row.names = TRUE,
text.to.char=TRUE, fix.names=TRUE,
..., overwrite = FALSE, append = FALSE) {
# Fix backslashes and double-quotes
temp <- as.data.frame(sapply(value,gsub,
pattern="\\\\",replacement="\\\\\\\\",perl=TRUE),stringsAsFactors=FALSE)
temp <- as.data.frame(sapply(temp,gsub,
pattern="\\\"",replacement="\\\\\"",perl=TRUE),stringsAsFactors=FALSE)
# If asked to do so in the call, make names lower case to avoid
# need to quote variable names in SQL queries if mixed case is present.
names(value) <- names(temp) <- make.db.names(conn,names(value))
if (fix.names) {
names(value) <- names(temp) <- gsub("(.*)","\\L\\1",names(temp), perl=TRUE)
}
# Call the underlying RPostgreSQL function
rs <- dbWriteTable(conn, name, value=temp, row.names = row.names,
overwrite=overwrite, append=append)
# Convert fields to appropriate types. This is necessary because the
# gsub() calls above converted everything to characters. But the following
# also retains Dates, which the basic function does not(?) do, and allows
# character vectors to be converted to VARCHAR() in PostgreSQL rather than TEXT.
# An alternative would seem to be to cast the elements of the data.frame
# back to the appropriate types before sending to PostgreSQL.
for (i in names(value) ) {
# Text to VARCHAR (unless asked not to do so)
if (is.character(value[,i]) && text.to.char) {
sql <- paste("ALTER TABLE ",name," ALTER COLUMN \"",i,
"\" TYPE varchar(",
max(nchar(encodeString(value[,i]))),")",sep="")
dbGetQuery(conn,sql)
} else if (is.integer(value[,i])) {
# Integers
sql <- paste("ALTER TABLE ",name," ALTER COLUMN \"",i,"\" TYPE bigint ",
"USING CAST(\"",i,"\" AS bigint)",sep="")
dbGetQuery(conn,sql)
} else if (class(value[,i])=="Date") {
# Dates. Needs to be done before conversion of doubles (below)
sql <- paste("ALTER TABLE ",name," ALTER COLUMN \"",i,"\" TYPE date ",
"USING to_date(\"",i,"\", 'YYYY-MM-DD')",sep="")
dbGetQuery(conn,sql)
} else if(is.double(value[,i])) {
# Doubles
sql <- paste("ALTER TABLE ",name," ALTER COLUMN \"",i,
"\" TYPE double precision",
" USING CAST(\"",i,"\" AS double precision)",sep="")
dbGetQuery(conn,sql)
}
}
# Vacuum the table (a need arises because of changes to data types, I guess)
dbGetQuery(conn,paste("VACUUM",name))
# Return the value returned by dbWriteTable
return(rs)
}
Original comment by iand...@gmail.com
on 30 Apr 2011 at 3:25
Attachments:
Here is an illustration of how dbWriteTable seems to depart from "expected
behavior."
Idea: Pull data, dbWriteTable back to PostgreSQL, pull data again.
With RPostgreSQL r181, it doesn't always work. Backslashes get messed up. Nice
feature of r181 is that this failure is apparent. My write.csv/psql "COPY"
kluge works, however. Not sure what is happening to cause this.
Output below (note that I wanted to be extra careful that I'd killed other
versions of RPostgreSQL that I had installed).
> remove.packages("RPostgreSQL")
Removing package(s) from ‘/home/iangow/R/x86_64-pc-linux-gnu-library/2.13’
(as ‘lib’ is unspecified)
> libs <- as.data.frame(installed.packages())
> libs[libs$Package=="RPostgreSQL",]
[1] Package LibPath Version Priority Depends Imports LinkingTo
[8] Suggests Enhances OS_type License Built
<0 rows> (or 0-length row.names)
> install.packages("~/Downloads/RPostgreSQL/", repos=NULL,
lib=.Library.site[[1]])
* installing *source* package ‘RPostgreSQL’ ...
checking for gcc... gcc
checking for C compiler default output file name... a.out
checking whether the C compiler works... yes
checking whether we are cross compiling... no
checking for suffix of executables...
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether gcc accepts -g... yes
checking for gcc option to accept ISO C89... none needed
checking for pg_config... /usr/bin/pg_config
checking for "/usr/include/postgresql/libpq-fe.h"... yes
configure: creating ./config.status
config.status: creating src/Makevars
** libs
make: Nothing to be done for `all'.
installing to /usr/local/lib/R/site-library/RPostgreSQL/libs
** R
** inst
** preparing package for lazy loading
Creating a new generic function for "format" in "RPostgreSQL"
Creating a new generic function for "print" in "RPostgreSQL"
** help
*** installing help indices
** building package indices ...
** testing if installed package can be loaded
* DONE (RPostgreSQL)
> library(RPostgreSQL)
Loading required package: DBI
> drv <- dbDriver("PostgreSQL")
> pg <- dbConnect(drv, dbname = "crsp")
> test.data <- dbGetQuery(pg,"SELECT DISTINCT company FROM
executive.executive")
> dim(test.data)
[1] 5806 1
> dbWriteTable(pg,"test_data", test.data, overwrite=TRUE)
Error in postgresqlgetResult(new.con) :
RS-DBI driver: (could not Retrieve the result : ERROR: extra data after last expected column
CONTEXT: COPY test_data, line 166: "166 PERFORMANCE TECHNOLOGIES INC \DE\
167 NEWCOR INC"
)
> dbGetQuery(pg,"DROP TABLE test_data")
> source("/home/iangow/Dropbox/AGL/Code/R/dbWriteTable.fast.R")
> dbWriteTable.fast(pg,"test_data", test.data)
Password for user iangow:
[1] TRUE
> test.data.2 <- dbGetQuery(pg,"SELECT * FROM test_data")
> test.data.2[166,]
[1] "PERFORMANCE TECHNOLOGIES INC \\DE\\"
> dim(test.data.2)
[1] 5806 1
> cat(test.data.2[166,])
PERFORMANCE TECHNOLOGIES INC \DE\>
Original comment by iand...@gmail.com
on 11 May 2011 at 3:54
What data is in executive.executive?
Original comment by ne...@neiltiffin.com
on 11 May 2011 at 12:43
Problematic data included in the attached.
Original comment by iand...@gmail.com
on 11 May 2011 at 1:18
Attachments:
According to
http://www.postgresql.org/docs/8.1/static/sql-copy.html
\b Backspace (ASCII 8)
\f Form feed (ASCII 12)
\n Newline (ASCII 10)
\r Carriage return (ASCII 13)
\t Tab (ASCII 9)
\v Vertical tab (ASCII 11)
are to be escaped in addition to \\ backslash.
At least, the need to escape for \t and \n is apparent.
To achieve this, the current code structure have to be largely changed.
Calling write.table() from R doesn't seem a good way.
Escaping backslash is not a trivial task due to Encoding like CP932 or
Shift_JIS.
So, perhaps every code should be encoded to UTF-8.
Original comment by tomoa...@kenroku.kanazawa-u.ac.jp
on 24 Aug 2011 at 12:21
r182 does try to solve this issue. Since the difference is quite large, I hope
you there would do thorough tests, and report any failure.
Original comment by tomoa...@kenroku.kanazawa-u.ac.jp
on 26 Aug 2011 at 8:42
Tomoaki:
I'm not sure I could say that I've done "thorough tests," but I've been using
r182 without issue for a while now. It seems to have addressed the issues I was
having with earlier versions.
Original comment by iand...@gmail.com
on 25 Sep 2011 at 12:34
Closed, thanks.
Original comment by tomoa...@kenroku.kanazawa-u.ac.jp
on 10 Oct 2011 at 7:47
Original issue reported on code.google.com by
iand...@gmail.com
on 26 Apr 2011 at 2:57Attachments: