Open dhersz opened 3 years ago
Using data.table::fread(qmethod = "escape")
didn't help.
library(gtfsio)
path <- system.file("extdata/ggl_gtfs.zip", package = "gtfsio")
gtfs <- import_gtfs(path, files = "routes")
gtfs
#> $routes
#> route_id route_short_name route_long_name
#> 1: A 17 Mission
#> route_desc route_type
#> 1: The ""A"" route travels from lower Mission to Downtown. 3
tmp <- tempfile(fileext = ".zip")
export_gtfs(gtfs, tmp)
gtfs <- import_gtfs(tmp)
#> Warning in data.table::fread(file.path(tmpdir, file_txt), nrows = 1):
#> Found and resolved improper quoting out-of-sample. First healed line
#> 2: <<A,17,Mission,"The \"\"A\"\" route travels from lower Mission to
#> Downtown.",3>>. If the fields are not quoted (e.g. field separator does not
#> appear within any field), try quote="" to avoid this warning.
gtfs
#> $routes
#> route_id route_short_name route_long_name
#> 1: A 17 Mission
#> route_desc route_type
#> 1: The \\"\\"A\\"\\" route travels from lower Mission to Downtown. 3
export_gtfs(gtfs, tmp)
gtfs <- import_gtfs(tmp)
#> Warning in data.table::fread(file.path(tmpdir, file_txt), nrows = 1):
#> Found and resolved improper quoting out-of-sample. First healed line 2:
#> <<A,17,Mission,"The \\\"\\\"A\\\"\\\" route travels from lower Mission to
#> Downtown.",3>>. If the fields are not quoted (e.g. field separator does not
#> appear within any field), try quote="" to avoid this warning.
gtfs
#> $routes
#> route_id route_short_name route_long_name
#> 1: A 17 Mission
#> route_desc
#> 1: The \\\\\\"\\\\\\"A\\\\\\"\\\\\\" route travels from lower Mission to Downtown.
#> route_type
#> 1: 3
This is how routes.txt
is specified in Google Example Feed:
route_id,route_short_name,route_long_name,route_desc,route_type
A,17,Mission,"The ""A"" route travels from lower Mission to Downtown.",3
Reproducing this issue with pure data.table:
library(data.table)
tmp <- tempfile(fileext = ".csv")
writeLines("col,col2\n\"hi \"\"my friend\"\"\",2", tmp)
dt <- fread(tmp)
dt
#> col col2
#> 1: hi ""my friend"" 2
fwrite(dt, tmp)
new_dt <- fread(tmp)
new_dt
#> col col2
#> 1: hi """"my friend"""" 2
fwrite(new_dt, tmp)
newer_dt <- fread(tmp)
newer_dt
#> col col2
#> 1: hi """"""""my friend"""""""" 2
With only one double quote fread first recovers from an issue, but then the problem persists:
library(data.table)
tmp <- tempfile(fileext = ".csv")
writeLines("col,col2\n\"hi \"my friend\"\",2", tmp)
dt <- fread(tmp)
#> Warning in fread(tmp): Found and resolved improper quoting in first 100 rows.
#> If the fields are not quoted (e.g. field separator does not appear within any
#> field), try quote="" to avoid this warning.
dt
#> col col2
#> 1: hi "my friend" 2
fwrite(dt, tmp)
new_dt <- fread(tmp)
new_dt
#> col col2
#> 1: hi ""my friend"" 2
fwrite(new_dt, tmp)
newer_dt <- fread(tmp)
newer_dt
#> col col2
#> 1: hi """"my friend"""" 2
This seems a data.table
bug.
data.table::fwrite()
includes the qmethod
argument to deal with embedded double quotes. By default ("double"
) the double quote is doubled with another one (quoted that from the function documentation).
So double double-quotes should be read as a single double-quote (this got confusing, but I mean that a field with a ""
should be parsed simply as "
). But by default fread()
doesn't do that, as you can see, it keeps the double double-quotes duplicated (lol). And when you write it with fwrite()
they get yet again doubled, which results in this error.
That's how I interpret this issue. I'll see if I file an issue in data.table
repo soon.
Definitely a bug in data.table:
library(data.table)
tmp <- tempfile(fileext = ".csv")
writeLines("col,col2\n\"hi \"\"my friend\"\"\",2", tmp)
dt <- fread(tmp)
dt
#> col col2
#> 1: hi ""my friend"" 2
df <- read.csv(tmp)
df
#> col col2
#> 1 hi "my friend" 2
tbl <- readr::read_csv(tmp)
#>
#> ── Column specification ────────────────────────────────────────────────────────
#> cols(
#> col = col_character(),
#> col2 = col_double()
#> )
tbl
#> # A tibble: 1 x 2
#> col col2
#> <chr> <dbl>
#> 1 "hi \"my friend\"" 2
Issue filed: https://github.com/Rdatatable/data.table/issues/5088 (duplicate of https://github.com/Rdatatable/data.table/issues/4779).
Weird behaviour.