Open akgold opened 3 years ago
I'd say this is all working as documented. This is how googlesheets4, readxl, and readr have always worked: there's a guess_max
parameter that controls how many rows we look at when guessing column type.
(It's possible that round tripping empty strings should be easier? But that's a separate matter.)
library(googlesheets4)
library(googledrive)
library(tidyverse)
# hidden auth chunk here
tbl <- tibble(i = 1:2000, str = c(rep("", 1999), "blah"))
ss <- gs4_create(sheets = tbl)
#> ✓ Creating new Sheet: 'compulsory-fanworms'
# saw some weird errors, so let's slow things down
Sys.sleep(2)
dat <- read_sheet(ss, "tbl")
#> ✓ Reading from 'compulsory-fanworms'
#> ✓ Range ''tbl''
dat
#> # A tibble: 2,000 x 2
#> i str
#> <dbl> <lgl>
#> 1 1 NA
#> 2 2 NA
#> 3 3 NA
#> 4 4 NA
#> 5 5 NA
#> 6 6 NA
#> 7 7 NA
#> 8 8 NA
#> 9 9 NA
#> 10 10 NA
#> # … with 1,990 more rows
tail(dat)
#> # A tibble: 6 x 2
#> i str
#> <dbl> <lgl>
#> 1 1995 NA
#> 2 1996 NA
#> 3 1997 NA
#> 4 1998 NA
#> 5 1999 NA
#> 6 2000 NA
The guess_max
argument is important:
guess_max
Maximum number of data rows to use for guessing column types.
Its default is 1000 (or n_max
, if that’s smaller):
read_sheet(ss, ..., guess_max = min(1000, n_max))
This is also how readxl and readr work.
dat <- read_sheet(ss, "tbl", guess_max = Inf)
#> ✓ Reading from 'compulsory-fanworms'
#> ✓ Range ''tbl''
dat
#> # A tibble: 2,000 x 2
#> i str
#> <dbl> <chr>
#> 1 1 <NA>
#> 2 2 <NA>
#> 3 3 <NA>
#> 4 4 <NA>
#> 5 5 <NA>
#> 6 6 <NA>
#> 7 7 <NA>
#> 8 8 <NA>
#> 9 9 <NA>
#> 10 10 <NA>
#> # … with 1,990 more rows
tail(dat)
#> # A tibble: 6 x 2
#> i str
#> <dbl> <chr>
#> 1 1995 <NA>
#> 2 1996 <NA>
#> 3 1997 <NA>
#> 4 1998 <NA>
#> 5 1999 <NA>
#> 6 2000 blah
is.character(dat$str)
#> [1] TRUE
An absence of cell data is brought in as NA
, but we can use
tidyr::replace_na()
to bring back explicit empty strings, if desired.
dat <- dat %>%
replace_na(list(str = ""))
identical(tbl$str, dat$str)
#> [1] TRUE
drive_rm(ss)
#> File deleted:
#> • 'compulsory-fanworms' <id: 1Mu0JrRh0ZASocbL0SosFeCYvaC-4_ECCoRmkkixJfTU>
Created on 2021-07-11 by the reprex package (v2.0.0.9000)
That totally makes sense. The thing that made this difficult for me to puzzle out was the silence of this potentially destructive behavior. I wasn't aware of guess_max
, and no error or warning occurred here to alert me to that being what was going on. My "solution" was to put a junk character in the first row so it would guess correctly.
Maybe there's no easy way to tell when this occurs, but something like, "First value non-blank value in col str
found after guess_max
, coercing to logical
." would've been super helpful.
In readr this would show up in the "problems" report. I'll reopen this.
readr::read_csv("x,y\na,\nc,d", guess_max = 1)
#> Warning: 1 parsing failure.
#> row col expected actual file
#> 2 y 1/0/T/F/TRUE/FALSE d literal data
#> # A tibble: 2 x 2
#> x y
#> <chr> <lgl>
#> 1 a NA
#> 2 c NA
Created on 2021-07-12 by the reprex package (v2.0.0.9000)
My "solution" was to put a junk character in the first row so it would guess correctly.
Just as an FYI for all of these packages, I think it's more common to do guess_max = Inf
either as a policy (if one works with datasets where the size is not prohibitive for this) or when doing the initial "getting to know you" intake of a dataset. Definitely better than editing the data itself. The real gold standard is to provide column types (I would definitely specify that a column is a character, before I added junk data to influence guessing).
This will presumably get bundled up with the Great Col Spec Project that is coming soon for me (googlesheets4 & readxl) #51.
On read, a column is assumed to be logical after some (maybe 1,000?) rows of missing data. If a character is at the bottom, then this silently fails with no error.