ropensci / neotoma

Programmatic R interface to the Neotoma Paleoecological Database.
https://docs.ropensci.org/neotoma
Other
30 stars 16 forks source link

Dynamically update tables. #57

Closed SimonGoring closed 10 years ago

SimonGoring commented 11 years ago

There will always be some tables sitting in the package as Rdata files. It would be great if we could check to see if the tables were out of date by talking to neotoma each time the package is started up.

i.e., on package load check the table on neotoma against the table in the program directory to tell if its been modified since the package was last updated. If the package is older than the last instance of the table on the website then send a warning.

SimonGoring commented 11 years ago

@karthik or @SChamberlain I'm not really sure how to get a function to run on package startup. Right now gp.table lives in data, and I've written a function called check.tables (I should fix the name to be consistent). I think I'd like to have the table live in data, just in case someone is doing some work offline (?), but I'd like to call out to check if the table is the same as the one on the Neotoma site using the API when the package is first loaded by calling check.tables.

Is there a way to do this? I assume it's by using onLoad, but when I modified NAMESPACE I lost the change the next time I compiled the package.

sckott commented 11 years ago

Hmm, I do use .onLoad, bit I've not used it in this particular way before. Maybe @karthik or @gavinsimpson may know

gavinsimpson commented 11 years ago

There are .onLoad and .onAttach hooks for packages, but I would suggest an alternative approach. Doing this every time user loads the package will incur a delay as the API is queried etc. What happens if the user is not connected, do they have to wait for a time-out?

Instead, how about a function the user can call to freshen the tables in the current session? They would call that as needed.

In general though, this is all a little inelegant, unless you have a mechanism to cache those updated tables locally? Is there a way we could think that through?

SimonGoring commented 11 years ago

Thanks @gavinsimpson and @SChamberlain. There is a command called check.tables which was the command I wanted to have load up. In the neotoma package there were going to be a few tables that would sit in the data folder to speed things up (automatically looking up the numeric gpid for example), but these tables might change over time, and right now it's not clear to me what the protocol on the Neotoma side of things might be for changes.

In particular, we're going to move towards a 'steward' system for Neotoma meaning that there are domain experts who will vet data before entering it, and these experts can update the tables (probably not geographic tables, but taxonomic tables for sure). If that's the case, then it might cause problems, especially as Neotoma grows (once new data starts coming online at a regular pace).

Also, if we see geopolitical turmoil, such as we might expect after peak R, then those tables might change as well.

So, someone runs check.table, it tells them that the tables are not identical, and then asks them to update the package. It would be awesome if we could replace only the tables in the data folder of the package, but I suspect that that is much more difficult.

Freshening the tables would be okay I guess, but you'd have to do that each session, so a more permanent solution would be ideal.

Anyway, it should be straightforward to at least check that the user is connected to the internet before running the API. It's not implemented right now in this package, but there are some simple tests here that I could run.

sckott commented 11 years ago

@cboettig uses a function updateCache in rfishbase (here) to update the local copy of the data that comes with the package with that on the fishbase server. He allows user to specify path to file, with date in file name to differentiate different versions, so they can keep anywhere. Just another approach...

gavinsimpson commented 11 years ago

Carl's function writes to the current working directory by default, and it isn't changing the data in ./data package folder, which would most likely not be allowed by CRAN. To wit, from the CRAN POlicy Document:

Just be aware of this as the discussion continues.

SimonGoring commented 10 years ago

Okay. I'm going to drop this function. I've decided to go back and remove tables (except the pollen equivalence table) from the package, except those that perform explicit functionality for the package. Thanks @gavinsimpson for posting the Policies, it's probably safer to just let people update the package themselves. If they can install.package from github then they've got internet. . .