Open blaidd4drwg opened 2 years ago
NB: I dug through some readr code and it seems to me that read_csv
and spec_csv
are not using the same code-base: spec_csv
uses the function col_spec_standardise
where duplicated column names are dealt with:
if (anyDuplicated(col_names)) {
dups <- duplicated(col_names)
old_names <- col_names
col_names <- make.unique(col_names, sep = "_")
warning(
"Duplicated column names deduplicated: ",
paste0(
encodeString(old_names[dups], quote = "'"),
" => ",
encodeString(col_names[dups], quote = "'"),
" [", which(dups), "]",
collapse = ", "
),
call. = FALSE
)
}
And read_csv
is essentially a wrapper for vroom::vroom
which then deals with column names (in the C++ code part I think?).
It seems to me that the code to make unique names for duplicated column names is not correct and should follow the same pattern as vroom/vroom_ or vice versa.
Edit:
read_csv uses vctrs::vec_as_names
to create syntactically valid names (in col_types_standardise
). Thus spec_csv
should use the same call.
Sometimes I know the duplicated columns exisited, but I just wanna ignore it. However, with the current defualt name_repair, looking like "num...1", I cannot ignore it, I have to deal with it.
Note to self:
Right now spec_csv()
is going through edition 1 read_csv()
which ends up in col_spec_standardise
and that has it's own custom name repair machinery https://github.com/tidyverse/readr/blob/85cf1e8e664e2a75eba393f91d89b39060140dab/R/col_types.R#L441-L458
The fact that we force spec_csv()
to edition 1 is due to: https://github.com/tidyverse/readr/commit/c88303c8f62478d752ec21381feb80bccebb619d
We don't see a way to allow spec_csv()
to use edition 2 because vroom doesn't support guessing without parsing the data (as the above commit mentions).
value <- I("a,error,b,error,c,error\n1,string1,2,string2,3,string3")
# note that edition 1 actually guesses column types
readr::with_edition(1, readr::read_csv(value, n_max = 0, guess_max = 1000))
#> Warning: Duplicated column names deduplicated: 'error' => 'error_1' [4], 'error'
#> => 'error_2' [6]
#> # A tibble: 0 × 6
#> # … with 6 variables: a <dbl>, error <chr>, b <dbl>, error_1 <chr>, c <dbl>,
#> # error_2 <chr>
#> # ℹ Use `colnames()` to see all variable names
# while edition 2 returns all characters
readr::read_csv(value, n_max = 0, guess_max = 1000)
#> New names:
#> Rows: 0 Columns: 6
#> ── Column specification
#> ──────────────────────────────────────────────────────── Delimiter: "," chr
#> (6): a, error...2, b, error...4, c, error...6
#> ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
#> Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> • `error` -> `error...2`
#> • `error` -> `error...4`
#> • `error` -> `error...6`
#> # A tibble: 0 × 6
#> # … with 6 variables: a <chr>, error...2 <chr>, b <chr>, error...4 <chr>,
#> # c <chr>, error...6 <chr>
#> # ℹ Use `colnames()` to see all variable names
Created on 2022-08-23 by the reprex package (v2.0.1.9000)
There are three options for moving forward:
name_repair
argument to all instances where read_delimited()
is called. This would require a rather large change to the code base but it would mean the name_repair
argument would now be respected for edition 1 read_csv()
(and friends).name_repair = unique
and continue to not respect name_repair
. It would make spec_csv()
and read_csv()
perform the same for default name repair and we wouldn't need to change a large part of the code base. We currently don't respect name_repair
in edition 1, so continuing this behavior might not be a big deal.name_repair
still not being respected. We also don't feel like name repair for duplicate columns names is a safe or advised maneuver to rely on when copying a spec from spec_csv()
for use in read_csv()
. It seems like a stopgap readr 2e / vroom version of spec_csv()
could just call read_csv()
and then return the spec from that result.
(BTW various things seem wonky with this printed output)
library(readr)
value <- I("a,error,b,error,c,error\n1,string1,2,string2,3,string3")
dat <- read_csv(value)
#> New names:
#> Rows: 1 Columns: 6
#> ── Column specification
#> ──────────────────────────────────────────────────────── Delimiter: "," chr
#> (3): error...2, error...4, error...6 dbl (3): a, b, c
#> ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
#> Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> • `error` -> `error...2`
#> • `error` -> `error...4`
#> • `error` -> `error...6`
spec(dat)
#> cols(
#> a = col_double(),
#> error...2 = col_character(),
#> b = col_double(),
#> error...4 = col_character(),
#> c = col_double(),
#> error...6 = col_character()
#> )
readr 2e / vroom column names: a, error...2, b, error...4, c, error...6
What happens if we just ask for the spec from the same input?
spec_csv(value)
#> Warning: Duplicated column names deduplicated: 'error' => 'error_1' [4], 'error'
#> => 'error_2' [6]
#> cols(
#> a = col_double(),
#> error = col_character(),
#> b = col_double(),
#> error_1 = col_character(),
#> c = col_double(),
#> error_2 = col_character()
#> )
Problem: we get readr 1e-style column names: a, error, b, error_1, c, error_2
It does seem reasonable to expect spec_csv()
to return the same spec as spec(read_csv())
, at least with respect to column names.
A simple fix would be to call read_csv()
on the input with n_max = guess_max
and then return that spec.
spec_csv_vroom <- function(..., guess_max = 1000) {
tmp <- readr::read_csv(..., n_max = guess_max, guess_max = guess_max)
spec(tmp)
}
spec_csv_vroom(value)
#> New names:
#> Rows: 1 Columns: 6
#> ── Column specification
#> ──────────────────────────────────────────────────────── Delimiter: "," chr
#> (3): error...2, error...4, error...6 dbl (3): a, b, c
#> ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
#> Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> • `error` -> `error...2`
#> • `error` -> `error...4`
#> • `error` -> `error...6`
#> cols(
#> a = col_double(),
#> error...2 = col_character(),
#> b = col_double(),
#> error...4 = col_character(),
#> c = col_double(),
#> error...6 = col_character()
#> )
Is there a reason not to do something along these lines?
Is there a reason not to do something along these lines?
Performance? I guess? i.e. this commit from Jim talks about vroom not supporting guessing without parsing https://github.com/tidyverse/readr/commit/c88303c8f62478d752ec21381feb80bccebb619d
But since spec_csv()
and friends are probably relatively rarely used, the benefits of updating them to 2e probably greatly outweigh performance costs - especially since the long term goal is to remove 1e code
Is there a reason not to do something along these lines?
Same as @DavisVaughan, it would solve the problem especially in the long term for removing readr edition 1 code, but having spec_csv()
be just a wrapper for spec(read_csv(n_max = guess_max))
seems odd
Yeah the only thing I could think of was performance (or elegance). But I think returning the wrong column names is a much bigger sin that the downsides of a simple spec_*()
implementation.
"readr doesn't have to parse to guess col types but vroom does" seems a bit misleading. readr is definitely consulting (tokenizing and typing) up to guess_max
rows of data, in order to guess the column types. This particular point doesn't really resonate with me. I suppose if we were truly only column type guessing, we wouldn't load the data atoms into a data frame, but this seems like an awfully subtle thing to be worried about. Does anyone really "get" this point?
having
spec_csv()
be just a wrapper forspec(read_csv(n_max = guess_max))
seems odd
How so? It seems like the point of spec_csv()
is to return the spec associated with the data frame you'd get if you read that file.
I know that spec_csv()
is basically already a wrapper for ed1 spec(read_csv(n_max = 0, guess_max = guess_max))
so I guess it's not the "wrapper" bit that feels odd. Maybe I'm just blocked by what I think spec_csv()
should be doing (ie guessing column types but not reading anything).
guessing column types but not reading anything
But you can't guess column types without reading from the file.
When trying to create column specifications of a file that contains duplicated variable names, spec_csv() renames the variables differently (e.g "error_1") than read_csv() ("error...1").
Let test.csv be a simple CSV with duplicated variables "error" and one observation: