r-transit / gtfsio

Read and Write General Transit Feed Specification (GTFS)
https://r-transit.github.io/gtfsio/
Other
13 stars 3 forks source link

Double quotes inside a string getting duplicated when exporting GTFS object #21

Open dhersz opened 3 years ago

dhersz commented 3 years ago

Weird behaviour.

library(gtfsio)

path <- system.file("extdata/ggl_gtfs.zip", package = "gtfsio")

gtfs <- import_gtfs(path, files = "routes")
gtfs
#> $routes
#>    route_id route_short_name route_long_name
#> 1:        A               17         Mission
#>                                                 route_desc route_type
#> 1: The ""A"" route travels from lower Mission to Downtown.          3

tmp <- tempfile(fileext = ".zip")
export_gtfs(gtfs, tmp)

gtfs <- import_gtfs(tmp)
gtfs
#> $routes
#>    route_id route_short_name route_long_name
#> 1:        A               17         Mission
#>                                                     route_desc route_type
#> 1: The """"A"""" route travels from lower Mission to Downtown.          3

export_gtfs(gtfs, tmp)
gtfs <- import_gtfs(tmp)
gtfs
#> $routes
#>    route_id route_short_name route_long_name
#> 1:        A               17         Mission
#>                                                             route_desc
#> 1: The """"""""A"""""""" route travels from lower Mission to Downtown.
#>    route_type
#> 1:          3
dhersz commented 3 years ago

Using data.table::fread(qmethod = "escape") didn't help.

library(gtfsio)

path <- system.file("extdata/ggl_gtfs.zip", package = "gtfsio")

gtfs <- import_gtfs(path, files = "routes")
gtfs
#> $routes
#>    route_id route_short_name route_long_name
#> 1:        A               17         Mission
#>                                                 route_desc route_type
#> 1: The ""A"" route travels from lower Mission to Downtown.          3

tmp <- tempfile(fileext = ".zip")
export_gtfs(gtfs, tmp)

gtfs <- import_gtfs(tmp)
#> Warning in data.table::fread(file.path(tmpdir, file_txt), nrows = 1):
#> Found and resolved improper quoting out-of-sample. First healed line
#> 2: <<A,17,Mission,"The \"\"A\"\" route travels from lower Mission to
#> Downtown.",3>>. If the fields are not quoted (e.g. field separator does not
#> appear within any field), try quote="" to avoid this warning.
gtfs
#> $routes
#>    route_id route_short_name route_long_name
#> 1:        A               17         Mission
#>                                                         route_desc route_type
#> 1: The \\"\\"A\\"\\" route travels from lower Mission to Downtown.          3

export_gtfs(gtfs, tmp)
gtfs <- import_gtfs(tmp)
#> Warning in data.table::fread(file.path(tmpdir, file_txt), nrows = 1):
#> Found and resolved improper quoting out-of-sample. First healed line 2:
#> <<A,17,Mission,"The \\\"\\\"A\\\"\\\" route travels from lower Mission to
#> Downtown.",3>>. If the fields are not quoted (e.g. field separator does not
#> appear within any field), try quote="" to avoid this warning.
gtfs
#> $routes
#>    route_id route_short_name route_long_name
#> 1:        A               17         Mission
#>                                                                         route_desc
#> 1: The \\\\\\"\\\\\\"A\\\\\\"\\\\\\" route travels from lower Mission to Downtown.
#>    route_type
#> 1:          3
dhersz commented 3 years ago

This is how routes.txt is specified in Google Example Feed:

route_id,route_short_name,route_long_name,route_desc,route_type
A,17,Mission,"The ""A"" route travels from lower Mission to Downtown.",3
dhersz commented 3 years ago

Reproducing this issue with pure data.table:

library(data.table)

tmp <- tempfile(fileext = ".csv")
writeLines("col,col2\n\"hi \"\"my friend\"\"\",2", tmp)

dt <- fread(tmp)
dt
#>                 col col2
#> 1: hi ""my friend""    2

fwrite(dt, tmp)
new_dt <- fread(tmp)
new_dt
#>                     col col2
#> 1: hi """"my friend""""    2

fwrite(new_dt, tmp)
newer_dt <- fread(tmp)
newer_dt
#>                             col col2
#> 1: hi """"""""my friend""""""""    2
dhersz commented 3 years ago

With only one double quote fread first recovers from an issue, but then the problem persists:

library(data.table)

tmp <- tempfile(fileext = ".csv")
writeLines("col,col2\n\"hi \"my friend\"\",2", tmp)

dt <- fread(tmp)
#> Warning in fread(tmp): Found and resolved improper quoting in first 100 rows.
#> If the fields are not quoted (e.g. field separator does not appear within any
#> field), try quote="" to avoid this warning.
dt
#>               col col2
#> 1: hi "my friend"    2

fwrite(dt, tmp)
new_dt <- fread(tmp)
new_dt
#>                 col col2
#> 1: hi ""my friend""    2

fwrite(new_dt, tmp)
newer_dt <- fread(tmp)
newer_dt
#>                     col col2
#> 1: hi """"my friend""""    2
dhersz commented 3 years ago

This seems a data.table bug.

data.table::fwrite() includes the qmethod argument to deal with embedded double quotes. By default ("double") the double quote is doubled with another one (quoted that from the function documentation).

So double double-quotes should be read as a single double-quote (this got confusing, but I mean that a field with a "" should be parsed simply as "). But by default fread() doesn't do that, as you can see, it keeps the double double-quotes duplicated (lol). And when you write it with fwrite() they get yet again doubled, which results in this error.

That's how I interpret this issue. I'll see if I file an issue in data.table repo soon.

dhersz commented 3 years ago

Definitely a bug in data.table:

library(data.table)

tmp <- tempfile(fileext = ".csv")
writeLines("col,col2\n\"hi \"\"my friend\"\"\",2", tmp)

dt <- fread(tmp)
dt
#>                 col col2
#> 1: hi ""my friend""    2

df <- read.csv(tmp)
df
#>              col col2
#> 1 hi "my friend"    2

tbl <- readr::read_csv(tmp)
#> 
#> ── Column specification ────────────────────────────────────────────────────────
#> cols(
#>   col = col_character(),
#>   col2 = col_double()
#> )
tbl
#> # A tibble: 1 x 2
#>   col                 col2
#>   <chr>              <dbl>
#> 1 "hi \"my friend\""     2
dhersz commented 3 years ago

Issue filed: https://github.com/Rdatatable/data.table/issues/5088 (duplicate of https://github.com/Rdatatable/data.table/issues/4779).