Failure when getting date range including 2018-03-31 to 2018-04-01

pbulsink commented 6 years ago

Expected Behavior

weather_dl() should always return a data frame.

Current Behavior

Sometimes, weather_dl() reports

Error in read.table(file = file, header = header, sep = sep, quote = quote,   : 
more columns than column names

Steps to Reproduce (for bugs)

When extracting data over a date range that includes 2018-03-31 through 2018-04-01, the error is produced. This is not weather-station dependent.

kam <- weather_dl(station_ids = 51423, start = "2018-03-31", end = "2018-04-01") kam2 <- weather_dl(station_ids=42203, start = "2018-03-31", end = "2018-04-01") kempt <- weather_dl(station_ids = 27534, start = "2018-03-31", end = "2018-04-01")

Any other date range is ok: kam <- weather_dl(station_ids = 51423, start = "2018-03-01", end = "2018-03-31") kam <- weather_dl(station_ids = 51423, start = "2018-04-01", end = "2018-04-30") kam <- weather_dl(station_ids = 51423, start = "2017-03-31", end = "2017-04-01")

Possible Solution

Comparing the column names of both dates retrieved individually (2018-03-31, 2018-04-01) returns no differences with testthat::expect_equivalent().

Your Environment

Session info ------------------------------------------------------------------------------------------------------------- setting value
version R version 3.5.0 (2018-04-23) system x86_64, mingw32
ui RStudio (1.1.383)
language (EN)
collate English_Canada.1252
tz America/New_York
date 2018-05-29

Packages ----------------------------------------------------------------------------------------------------------------- package version date source
assertthat 0.2.0 2017-04-11 CRAN (R 3.5.0)
base 3.5.0 2018-04-23 local
bindr 0.1.1 2018-03-13 CRAN (R 3.5.0)
bindrcpp 0.2.2 2018-03-29 CRAN (R 3.5.0)
compiler 3.5.0 2018-04-23 local
curl 3.2 2018-03-28 CRAN (R 3.5.0)
datasets 3.5.0 2018-04-23 local
devtools 1.13.5 2018-02-18 CRAN (R 3.5.0)
digest 0.6.15 2018-01-28 CRAN (R 3.5.0)
dplyr 0.7.5 2018-05-19 CRAN (R 3.5.0)
glue 1.2.0 2017-10-29 CRAN (R 3.5.0)
graphics 3.5.0 2018-04-23 local
grDevices 3.5.0 2018-04-23 local
httr 1.3.1 2017-08-20 CRAN (R 3.5.0)
lubridate 1.7.4 2018-04-11 CRAN (R 3.5.0)
magrittr 1.5 2014-11-22 CRAN (R 3.5.0)
memoise 1.1.0 2017-04-21 CRAN (R 3.5.0)
methods 3.5.0 2018-04-23 local
pillar 1.2.1 2018-02-27 CRAN (R 3.5.0)
pkgconfig 2.0.1 2017-03-21 CRAN (R 3.5.0)
purrr 0.2.4 2017-10-18 CRAN (R 3.5.0)
R6 2.2.2 2017-06-17 CRAN (R 3.5.0)
Rcpp 0.12.16 2018-03-13 CRAN (R 3.5.0)
rlang 0.2.0 2018-02-20 CRAN (R 3.5.0)
stats 3.5.0 2018-04-23 local
stringi 1.1.7 2018-03-12 CRAN (R 3.5.0)
stringr 1.3.1 2018-05-10 CRAN (R 3.5.0)
testthat 2.0.0 2017-12-13 CRAN (R 3.5.0)
tibble 1.4.2 2018-01-22 CRAN (R 3.5.0)
tidyr 0.8.1 2018-05-18 CRAN (R 3.5.0)
tidyselect 0.2.4 2018-02-26 CRAN (R 3.5.0)
tools 3.5.0 2018-04-23 local
utils 3.5.0 2018-04-23 local
weathercan 0.2.6 2018-05-29 Github (ropensci/weathercan@3b8d291) withr 2.1.2 2018-03-15 CRAN (R 3.5.0)
xml2 1.2.0 2018-01-24 CRAN (R 3.5.0)
yaml 2.1.18 2018-03-08 CRAN (R 3.5.0)

pbulsink commented 6 years ago

traceback()

9: stop("more columns than column names")
8: read.table(file = file, header = header, sep = sep, quote = quote, 
       dec = dec, fill = fill, comment.char = comment.char, ...)
7: utils::read.csv(text = httr::content(html, as = "text", type = "text/csv", 
       encoding = encoding), nrows = nrows, strip.white = TRUE, 
       skip = skip, header = header, colClasses = "character", check.names = FALSE)
6: eval(lhs, parent, parent)
5: eval(lhs, parent, parent)
4: utils::read.csv(text = httr::content(html, as = "text", type = "text/csv", 
       encoding = encoding), nrows = nrows, strip.white = TRUE, 
       skip = skip, header = header, colClasses = "character", check.names = FALSE) %>% 
       dplyr::mutate_at(.vars = dplyr::vars(dplyr::ends_with("Flag")), 
           dplyr::funs(gsub("^I$", "^", .)))
3: weather_raw(station_id = s, date = date_range[i], interval = interval, 
       skip = skip, url = url, encoding = encoding)
2: rbind(w, weather_raw(station_id = s, date = date_range[i], interval = interval, 
       skip = skip, url = url, encoding = encoding))
1: weather_dl(station_ids = 51423, start = "2018-03-31", end = "2018-04-01")

steffilazerte commented 6 years ago

Thanks for the comprehensive testing! I can see where the problem is (the preamble or metadata changes in the files between pre and post April 2018). I've got a couple of ideas for addressing the problem which will hopefully make the package more robust to future changes. Will try them out and let you know how it goes.

steffilazerte commented 6 years ago

Okay, so the newest version on the dev-steffi branch should have fixed this issue. Could you let me know if it works for you? Note that this will result in a new column (station operator) in data that includes a date after April 1st 2018.

pbulsink commented 6 years ago

Seems to work on my end too. Thanks for your quick help!

pbulsink commented 6 years ago

One more error in the same vein: When downloading from multiple station ID's with old and new data (such as 4291 and 27534, where the latter 'supersceded' the former for data collection in 1997), the rbind fails as they have different numbers of columns. Example code:

weather_dl(c(4291, 27534), interval = 'day')

And the traceback:

traceback()
4: stop("numbers of columns of arguments do not match")
3: rbind(deparse.level, ...)
2: rbind(w_all, w)
1: weathercan::weather_dl(c(4291, 27534), interval = "day", start = "1996-01-01")

A proposed fix could be using dplyr::bind_rows() instead of base::rbind() as it quietly and automatically fills missing columns with NA values.

steffilazerte commented 6 years ago

Ah thanks for catching that, I overlooked that part of the testing. Will fix it now.

steffilazerte commented 6 years ago

Okay, columns are added for older data missing the new meta data columns (filled with NA if missing). This should fix the problem when joining new and old data between stations. Let me know how it goes!

pbulsink commented 6 years ago

Seems to work great now! Thanks

pbulsink commented 6 years ago

One thing to consider: the builtin kamloops and kamloops_day datasets don't have a station_operator column. When rebuilding for master they should maybe be updated, but this may introduce breaking changes to your or other's testing code.

steffilazerte commented 6 years ago

They should definitely be updated, as should the Read me and Vignettes so they match the current state of the package. I don't think they'll break any tests (that I know of). Thanks for the reminder!

ropensci / weathercan