tidyverse / readr

Read flat files (csv, tsv, fwf) into R
https://readr.tidyverse.org
Other
1.01k stars 286 forks source link

Unexpected behaviour when reading empty columns #1551

Open lowk opened 1 month ago

lowk commented 1 month ago

I am having issues getting readr::read_csv to treat empty columns as strings rather than as vectors of logical NAs. Setting na = "NA" does not have the expected effect (it still reads the empty columns as vectors of logical NAs) and also produces an unexpected warning message. It seems possible that this might be a bug.

An example of what I mean:

#Make a tibble with an empty column
df_write <- tibble::tibble(a = 1:3,b = "")

#Check the class of column b
class(df_write$b)
# [1] "character"

#Write the tibble to a temporary file
filepath <- tempfile()
readr::write_csv(df_write,file = filepath)

# read it in using read_csv, treating empty strings as characters rather than missing:
df_read <- readr::read_csv(file = filepath, col_types = readr::cols(),na = "NA")

#Warning message:                                                                                                   
#One or more parsing issues, call `problems()` on your data frame for details,
#e.g.:
#  dat <- vroom(...)
#   problems(dat) 

#Check the class of column b
class(df_read$b)
#[1] "logical"

# check the problems
problems(df_read)
# A tibble: 3 × 5
#    row   col expected           actual file                                    
#  <int> <int> <chr>              <chr>  <chr>                                   
#1     2     2 1/0/T/F/TRUE/FALSE ""     /private/var/folders/65/zc1jdwvx0m5gw8t…
#2     3     2 1/0/T/F/TRUE/FALSE ""     /private/var/folders/65/zc1jdwvx0m5gw8t…
#3     4     2 1/0/T/F/TRUE/FALSE ""     /private/var/folders/65/zc1jdwvx0m5gw8t…

The reason I say that this seems like it might be a bug is that I think the expected behaviour here would be that if na = "NA", columns of empty strings should be treated as character vectors of empty strings rather than vectors of logical NAs.

Digging a bit deeper, the issue comes from parse_guess, which guesses a vector of empty strings as logical even if na = "NA":

#this gives logical, as expected:
readr::guess_parser("")
#[1] "logical"

#this also gives logical, whereas I would expect it to default to the more general "character":
readr::guess_parser("", na = "NA")
#[1] "logical"

If this isn't a bug, what is the correct way to get read_csv to read empty columns as strings, in general? Obviously, in the case above I can set cols(b = "character"), but what happens if I don't know ahead of time which columns will be full of empty strings?