mountainMath / cansim

Wrapper to access CANSIM data
Other
44 stars 5 forks source link

get_cansim_changed_tables inaccurate documentation #80

Closed ghost closed 4 years ago

ghost commented 4 years ago

The documentation for the function get_cansim_changed_tables(start_date) states that it will return "list of tables that have been modified or updated since the specified date". However, the function only returns tables that were changed on start_date, not those that have been changed since the specified date.

This isn't a big issue because a set of dates can be looped over to get the tables changed over a range of dates. However, this is less elegant than the functionality described in the function's documentation.

Here's what I get if I run the example in the documentation:

> get_cansim_changed_tables("2018-08-01")
# A tibble: 8 x 2
  productId releaseTime     
      <int> <chr>           
1  23100251 2018-08-01T08:35
2  33100036 2018-08-01T08:30
3  10100139 2018-08-01T08:30
4  10100125 2018-08-01T08:30
5  10100107 2018-08-01T08:30
6  33100005 2018-08-01T08:30
7  33100033 2018-08-01T08:30
8  33100084 2018-08-01T08:30

Based on the documentation I would expect to see all the tables changed from 2018-08-01 till today.

mountainMath commented 4 years ago

Thanks. That needs fixing. Flagged for the next release. Might have to change the argument name too as start_date is misleading. Or maybe iterate over days since start_date and add an end_date argument to align functionality with documentation. If end_date defaults to start_date if Null all current functionality will be retained.

Thoughts @dshkol?

dshkol commented 4 years ago

Agree, this should be fixed. The functionality describe is the functionality intended.

mountainMath commented 4 years ago

I have a fixed-up version in a new branch. https://github.com/mountainMath/cansim/tree/change_tables_fix This should address the problem by adding an optional end_date argument. If end_date is specified it will iterate through all dates between start and end date.

Please check to see if that solves the problem, install with

remotes::install_github("mountainmath/cansim@change_tables_fix")
dshkol commented 4 years ago

I've tested these two:

get_cansim_changed_tables(start_date =  "2018-08-01", end_date = "2019-08-01")
# A tibble: 7,342 x 2
   productId releaseTime     
       <int> <chr>           
 1  23100251 2018-08-01T08:35
 2  33100036 2018-08-01T08:30
 3  10100139 2018-08-01T08:30
 4  10100125 2018-08-01T08:30
 5  10100107 2018-08-01T08:30
 6  33100005 2018-08-01T08:30
 7  33100033 2018-08-01T08:30
 8  33100084 2018-08-01T08:30
 9  11100008 2018-08-02T08:35
10  33100163 2018-08-02T08:30
# … with 7,332 more rows

get_cansim_changed_tables(start_date =  "2020-09-07", end_date = "2020-09-14")
# A tibble: 97 x 2
   productId releaseTime     
       <int> <chr>           
 1  10100136 2020-09-08T08:30
 2  10100142 2020-09-08T08:30
 3  10100108 2020-09-08T08:30
 4  33100036 2020-09-08T08:30
 5  10100139 2020-09-08T08:30
 6  18100204 2020-09-08T08:30
 7  36100042 2020-09-08T08:30
 8  36100431 2020-09-08T08:30
 9  36100430 2020-09-08T08:30
10  18100160 2020-09-08T08:30
# … with 87 more rows

Both work, tho the first one is pretty slow. I wonder if there's a risk of bounced by the api if you run an especially long range?

dshkol commented 4 years ago

We should also add a safety check for the inputs if they're the wrong order.

get_cansim_changed_tables(start_date =  "2020-09-14", end_date = "2020-09-07")
Error in seq.int(0, to0 - from, by) : wrong sign in 'by' argument

Just a quick check of the input args that end_date >= start_date

# A tibble: 12 x 2
   productId releaseTime     
       <int> <chr>           
 1  10100158 2020-09-14T08:30
 2  10100142 2020-09-14T08:30
 3  33100036 2020-09-14T08:30
 4  10100139 2020-09-14T08:30
 5  32100359 2020-09-14T08:30
 6  12100120 2020-09-14T08:30
 7  18100248 2020-09-14T08:30
 8  14100332 2020-09-14T08:30
 9  14100331 2020-09-14T08:30
10  14100358 2020-09-14T08:30
11  14100357 2020-09-14T08:30
12  36100491 2020-09-14T08:30

Works ok too.

mountainMath commented 4 years ago

I added a check on the order and switch if it's mixed up, while also emitting a message informing the user. And added a check for querying time intervals longer than 31 days and emit a message warning the user that it may take long.

dshkol commented 4 years ago

I've pulled this into the dev branch for 0.3.6 and this will be fixed with the next release.