tidyverse / readr

Read flat files (csv, tsv, fwf) into R
https://readr.tidyverse.org
Other
1.01k stars 286 forks source link

Name repair for duplicated columns inconsistent between `read_csv` and `spec_csv` #1387

Open blaidd4drwg opened 2 years ago

blaidd4drwg commented 2 years ago

When trying to create column specifications of a file that contains duplicated variable names, spec_csv() renames the variables differently (e.g "error_1") than read_csv() ("error...1").

Let test.csv be a simple CSV with duplicated variables "error" and one observation:

echo "a,error,b,error,c,error\n1,string1,2,string2,3,string3" >> test.csv
readr::spec_csv("test.csv")
cols(
  a = col_double(),
  error = col_character(),
  b = col_double(),
  error_1 = col_character(),
  c = col_double(),
  error_2 = col_character()
)

Warning message:
Duplicated column names deduplicated: 'error' => 'error_1' [4], 'error' => 'error_2' [6] 
readr::read_csv("test.csv")
New names:                                                                                                                                                                                                  
* error -> error...2
* error -> error...4
* error -> error...6
Rows: 1 Columns: 6
── Column specification ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (3): error...2, error...4, error...6
dbl (3): a, b, c

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 1 × 6
      a error...2     b error...4     c error...6
  <dbl> <chr>     <dbl> <chr>     <dbl> <chr>    
1     1 string1       2 string2       3 string3
blaidd4drwg commented 2 years ago

NB: I dug through some readr code and it seems to me that read_csv and spec_csv are not using the same code-base: spec_csv uses the function col_spec_standardise where duplicated column names are dealt with:

  if (anyDuplicated(col_names)) {
    dups <- duplicated(col_names)

    old_names <- col_names
    col_names <- make.unique(col_names, sep = "_")

    warning(
      "Duplicated column names deduplicated: ",
      paste0(
        encodeString(old_names[dups], quote = "'"),
        " => ",
        encodeString(col_names[dups], quote = "'"),
        " [", which(dups), "]",
        collapse = ", "
      ),
      call. = FALSE
    )
  }

And read_csv is essentially a wrapper for vroom::vroom which then deals with column names (in the C++ code part I think?).

It seems to me that the code to make unique names for duplicated column names is not correct and should follow the same pattern as vroom/vroom_ or vice versa.

Edit:

read_csv uses vctrs::vec_as_names to create syntactically valid names (in col_types_standardise). Thus spec_csv should use the same call.

vieruodu commented 2 years ago

Sometimes I know the duplicated columns exisited, but I just wanna ignore it. However, with the current defualt name_repair, looking like "num...1", I cannot ignore it, I have to deal with it.

sbearrows commented 2 years ago

Note to self:

Right now spec_csv() is going through edition 1 read_csv() which ends up in col_spec_standardise and that has it's own custom name repair machinery https://github.com/tidyverse/readr/blob/85cf1e8e664e2a75eba393f91d89b39060140dab/R/col_types.R#L441-L458

The fact that we force spec_csv() to edition 1 is due to: https://github.com/tidyverse/readr/commit/c88303c8f62478d752ec21381feb80bccebb619d We don't see a way to allow spec_csv() to use edition 2 because vroom doesn't support guessing without parsing the data (as the above commit mentions).

value <- I("a,error,b,error,c,error\n1,string1,2,string2,3,string3")

# note that edition 1 actually guesses column types
readr::with_edition(1, readr::read_csv(value, n_max = 0, guess_max = 1000))
#> Warning: Duplicated column names deduplicated: 'error' => 'error_1' [4], 'error'
#> => 'error_2' [6]
#> # A tibble: 0 × 6
#> # … with 6 variables: a <dbl>, error <chr>, b <dbl>, error_1 <chr>, c <dbl>,
#> #   error_2 <chr>
#> # ℹ Use `colnames()` to see all variable names

# while edition 2 returns all characters
readr::read_csv(value, n_max = 0, guess_max = 1000)
#> New names:
#> Rows: 0 Columns: 6
#> ── Column specification
#> ──────────────────────────────────────────────────────── Delimiter: "," chr
#> (6): a, error...2, b, error...4, c, error...6
#> ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
#> Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> • `error` -> `error...2`
#> • `error` -> `error...4`
#> • `error` -> `error...6`
#> # A tibble: 0 × 6
#> # … with 6 variables: a <chr>, error...2 <chr>, b <chr>, error...4 <chr>,
#> #   c <chr>, error...6 <chr>
#> # ℹ Use `colnames()` to see all variable names

Created on 2022-08-23 by the reprex package (v2.0.1.9000)

There are three options for moving forward:

  1. Bring over the name repair machinery from vroom a la https://github.com/tidyverse/vroom/blob/339073c4fc5cfdcbeb29ff716d41c2aaaf6bad43/R/col_types.R#L389 and add the name_repair argument to all instances where read_delimited() is called. This would require a rather large change to the code base but it would mean the name_repair argument would now be respected for edition 1 read_csv() (and friends).
  2. Bring over the name repair machinery from vroom as above, but force name_repair = unique and continue to not respect name_repair. It would make spec_csv() and read_csv() perform the same for default name repair and we wouldn't need to change a large part of the code base. We currently don't respect name_repair in edition 1, so continuing this behavior might not be a big deal.
  3. Do nothing because the (potentially large) changes would occur in edition 1, which is feature frozen or the changes wouldn't be a perfect solution due to name_repair still not being respected. We also don't feel like name repair for duplicate columns names is a safe or advised maneuver to rely on when copying a spec from spec_csv() for use in read_csv().
jennybc commented 2 years ago

It seems like a stopgap readr 2e / vroom version of spec_csv() could just call read_csv() and then return the spec from that result.

(BTW various things seem wonky with this printed output)

library(readr)

value <- I("a,error,b,error,c,error\n1,string1,2,string2,3,string3")

dat <- read_csv(value)
#> New names:
#> Rows: 1 Columns: 6
#> ── Column specification
#> ──────────────────────────────────────────────────────── Delimiter: "," chr
#> (3): error...2, error...4, error...6 dbl (3): a, b, c
#> ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
#> Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> • `error` -> `error...2`
#> • `error` -> `error...4`
#> • `error` -> `error...6`

spec(dat)
#> cols(
#>   a = col_double(),
#>   error...2 = col_character(),
#>   b = col_double(),
#>   error...4 = col_character(),
#>   c = col_double(),
#>   error...6 = col_character()
#> )

readr 2e / vroom column names: a, error...2, b, error...4, c, error...6

What happens if we just ask for the spec from the same input?

spec_csv(value)
#> Warning: Duplicated column names deduplicated: 'error' => 'error_1' [4], 'error'
#> => 'error_2' [6]
#> cols(
#>   a = col_double(),
#>   error = col_character(),
#>   b = col_double(),
#>   error_1 = col_character(),
#>   c = col_double(),
#>   error_2 = col_character()
#> )

Problem: we get readr 1e-style column names: a, error, b, error_1, c, error_2

It does seem reasonable to expect spec_csv() to return the same spec as spec(read_csv()), at least with respect to column names.

A simple fix would be to call read_csv() on the input with n_max = guess_max and then return that spec.

spec_csv_vroom <- function(..., guess_max = 1000) {
  tmp <- readr::read_csv(..., n_max = guess_max, guess_max = guess_max)
  spec(tmp)
}

spec_csv_vroom(value)
#> New names:
#> Rows: 1 Columns: 6
#> ── Column specification
#> ──────────────────────────────────────────────────────── Delimiter: "," chr
#> (3): error...2, error...4, error...6 dbl (3): a, b, c
#> ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
#> Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> • `error` -> `error...2`
#> • `error` -> `error...4`
#> • `error` -> `error...6`
#> cols(
#>   a = col_double(),
#>   error...2 = col_character(),
#>   b = col_double(),
#>   error...4 = col_character(),
#>   c = col_double(),
#>   error...6 = col_character()
#> )

Is there a reason not to do something along these lines?

DavisVaughan commented 2 years ago

Is there a reason not to do something along these lines?

Performance? I guess? i.e. this commit from Jim talks about vroom not supporting guessing without parsing https://github.com/tidyverse/readr/commit/c88303c8f62478d752ec21381feb80bccebb619d

But since spec_csv() and friends are probably relatively rarely used, the benefits of updating them to 2e probably greatly outweigh performance costs - especially since the long term goal is to remove 1e code

sbearrows commented 2 years ago

Is there a reason not to do something along these lines?

Same as @DavisVaughan, it would solve the problem especially in the long term for removing readr edition 1 code, but having spec_csv() be just a wrapper for spec(read_csv(n_max = guess_max)) seems odd

jennybc commented 2 years ago

Yeah the only thing I could think of was performance (or elegance). But I think returning the wrong column names is a much bigger sin that the downsides of a simple spec_*() implementation.

"readr doesn't have to parse to guess col types but vroom does" seems a bit misleading. readr is definitely consulting (tokenizing and typing) up to guess_max rows of data, in order to guess the column types. This particular point doesn't really resonate with me. I suppose if we were truly only column type guessing, we wouldn't load the data atoms into a data frame, but this seems like an awfully subtle thing to be worried about. Does anyone really "get" this point?

having spec_csv() be just a wrapper for spec(read_csv(n_max = guess_max)) seems odd

How so? It seems like the point of spec_csv() is to return the spec associated with the data frame you'd get if you read that file.

sbearrows commented 2 years ago

I know that spec_csv() is basically already a wrapper for ed1 spec(read_csv(n_max = 0, guess_max = guess_max)) so I guess it's not the "wrapper" bit that feels odd. Maybe I'm just blocked by what I think spec_csv() should be doing (ie guessing column types but not reading anything).

jennybc commented 2 years ago

guessing column types but not reading anything

But you can't guess column types without reading from the file.