rywhale / kiwisR

Provides a simplified method for bringing tidy data into R from KISTERS WISKI databases via KiWIS API.
Other
12 stars 2 forks source link

Improve Handling of Query Limit Issues #13

Open Nova-Scotia opened 4 years ago

Nova-Scotia commented 4 years ago

I just updated kiwisR to take advantage of some newer functionality (ability to query other fields with ki_timeseries_values - thanks for adding this)!

Previously, I could run this code without errors:

library(tidyverse)
library(kiwisR)

# List all stations in swmc hub that match "SNOW-WLF" 
mystations <- ki_station_list(hub = myhub) %>% 
  filter(str_sub(station_no, 1, 8) == "SNOW-WLF" &
           !station_no %in% c("SNOW-WLF-XX", "SNOW-WLF-TEMP"))

# List all codes that connect the station codes with their available time series.
my_station_ids <- unique(mystations$station_id)

# List all timeseries associated with each code.
available_ts <- ki_timeseries_list(
  hub = myhub, 
  station_id = my_station_ids
)

Now, when I run the last line of code (list all timeseries associated with each code), I get the error, "Error: object of type 'closure' is not subsettable".

I tried putting the last line in a loop and it works ok - I thought I could find the problem station by doing this here, but I didn't.

for (i in seq_along(my_station_ids)) {
available_ts <- ki_timeseries_list(
  hub = myhub, 
  station_id = my_station_ids[i]
)
}

Can you reproduce this issue? If so, any idea on how to fix it?

Nova-Scotia commented 4 years ago

The code works this morning for test db. Haven't retried it on prod. Maybe it's related to the recent issues in production db?

rywhale commented 4 years ago

@Nova-Scotia Could be server-related but I think this might also be due to passing too many station_ids to ki_timeseries_list. If I run your example with hub="swmc" I get 200+ unique station_ids:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(stringr)
library(kiwisR)
#> Warning: package 'kiwisR' was built under R version 3.6.2

# List all stations in swmc hub that match "SNOW-WLF"
mystations <- ki_station_list(
  hub = "swmc"
  ) %>%
  filter(
    str_sub(station_no, 1, 8) == "SNOW-WLF" &
    !station_no %in% c(
      "SNOW-WLF-XX", 
      "SNOW-WLF-TEMP"
      )
    )

# List all codes that connect the station codes with their available time series.
my_station_ids <- unique(mystations$station_id)

print(length(my_station_ids))
#> [1] 218

Created on 2020-01-28 by the reprex package (v0.3.0)

You can get around this by chunking your query (e.g.- by filtering ts_name when you call ki_timeseries_list instead of getting every time series). Alternatively, you can loop and build the table with purrr::map_df. As a bonus, here's a method that will add a progress bar:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(stringr)
library(kiwisR)
#> Warning: package 'kiwisR' was built under R version 3.6.2
library(purrr)

# List all stations in swmc hub that match "SNOW-WLF"
mystations <- ki_station_list(
  hub = "swmc"
  ) %>%
  filter(
    str_sub(station_no, 1, 8) == "SNOW-WLF" &
    !station_no %in% c(
      "SNOW-WLF-XX", 
      "SNOW-WLF-TEMP"
      )
    )

# List all codes that connect the station codes with their available time series.
my_station_ids <- unique(mystations$station_id)

# Init progress bar
pb <- progress_estimated(length(my_station_ids))

station_ts <- my_station_ids %>%
  purrr::map_df(
    function(stn_id){
      # Add tick to bar
      pb$tick()$print()

      ki_timeseries_list(
        hub = "swmc",
        station_id = stn_id
      )
    }
  )

head(station_ts)
#> # A tibble: 6 x 6
#>   station_name station_id ts_id  ts_name from                to                 
#>   <chr>        <chr>      <chr>  <chr>   <dttm>              <dttm>             
#> 1 ALEXANDRIA   967631     12280~ SPG.Av~ NA                  NA                 
#> 2 ALEXANDRIA   967631     12156~ SPG.9.P NA                  NA                 
#> 3 ALEXANDRIA   967631     12156~ SPG.10~ NA                  NA                 
#> 4 ALEXANDRIA   967631     12156~ SPG.5.P NA                  NA                 
#> 5 ALEXANDRIA   967631     12156~ SPG.Av~ NA                  NA                 
#> 6 ALEXANDRIA   967631     12280~ SPG.Me~ NA                  NA

Created on 2020-01-28 by the reprex package (v0.3.0)

Open to suggestions on long-term fixes for the query limit issues. Two potential options:

Nova-Scotia commented 4 years ago

Nice ideas - thanks so much for taking the time to look into this. I'll try out your posted solution and I LOVE progress bars :). The server limit is definitely an annoying barrier.

Just wanted to note though - my previous script did work before I updated from an older version of kiwisR to the newer version. Did anything change? Test also has about the same amount of data as prod.

Nova-Scotia commented 4 years ago

Also, I like the first option if you're going to add a perma-fix - including the progress bar!

rywhale commented 4 years ago

@Nova-Scotia There are some changes coming down in R 4.0 related to classes/inheritance and I had to adjust some of the internal type checks to reflect this (#11, commit). More information on this here if you're interested in learning more.

I'm guessing that we're seeing this message when kiwisR fails to properly present the error message returned by the server upon reaching/surpassing the query limit. As to why you didn't hit the limit before- have no idea tbh. Perhaps the query limit is flexible/linked to how busy the server is 🤷‍♂ ?

Nova-Scotia commented 4 years ago

Hey Ryan, just a quick FYI - it seems like the behaviour of the server is unpredictable. I just tried a minor variation on the code you supplied and got an error:

# Init progress bar
pb <- progress_estimated(length(stations$station_id))

# Grab data in chunks so you don't hit the server limits
station_ts <- stations$station_id %>%
  purrr::map_df(
    function(stn_id){
      # Add tick to bar
      pb$tick()$print()

      ki_timeseries_list(
        hub = myhub,
        station_id = stn_id
      )
    }
  )

| | 0% ~11 s remaining Error: object of type 'closure' is not subsettable

I dislike these server limits very much!