Closed jennybc closed 9 years ago
This struck me as a pain point. The original gsheets_GET
function was (I think) always followed immediately by a call to gsheets_parse
, which suggests we should move that parsing into gsheets_GET
. I've done that now.
I then got a chance to share your XML parsing pain, using XML::getNodeSet
and xpathSApply
. I actually think these lines may have been stumbling into a memory leak problem that is apparently a "gotcha" just waiting to happen with the XML package:
The only thing of concern to us here is when we are dealing with internal/C-representations. The critical issue is simply this. Suppose we have an XML document created by either xmlParse() or newXMLDoc(). Then we use some mechanism (e.g. XPath or direct manipulation of the nodes) to get access to one or more nodes. Now, if we remove doc or simply return nodes from a function that has doc as a local variable, we have a problem.
This is also mentioned in examples in the help file. I even re-wrote some code to address this and have stashed locally, but it's a moot point now because I ultimately decided to use XML::xmlToList
and get the information out of XML and into an R native list ASAP (more below). I think this was quite likely contributing to the mysteriously slow performance of calls to our functions to open a spreadsheet, especially calls subsequent to an initial call.
I am concerned I will regret this policy decision, for example, once I get to refactoring the functions that update the spreadsheet. Obviously I will need to send XML so maybe it's untenable to abandon XML for R lists they way I have.
Hadley mentioned rvest
in the context of XML and HTML parsing, even if you are not web scraping. So if I backtrack and we get back into the XML parsing business, we should play with rvest
.
Examples suggested by @sckott of XML parsing in API wrapper packages:
taxize
package.Another thing I rationalized in my stashed XML parsing code is the handling of the namespaces. I pull them directly out of the XML and feed them back to functions from the XML package.
Links that were helpful to me during my brief battle with the XML:
Bottom line: I have gotten us out of the XML parsing business as quickly as possible. Unclear if this is desirable or sustainable long-term.
The decision to apply XML::xmlToList
early leaves us with lots of ugly lists. Ideally we would then use a list-friendly functional-programming-oriented package like rlist
or purrr
. I haven't taken this plunge yet because … neither has been around a long time. For example, I can't find any reverse dependencies on CRAN for rlist
. Hadley confirmed it was a bit early to depend on lowliner
(now purrr
). So right now we have quite a bit of ugly lapply
and plyr::l*ply
code for dealing with these lists. I have even written some crude utility functions to filter and pluck list components based on element name. I think we should just get our basic functionality in place and check back in on the world of FP for lists in a few months.
I want a comprehensive look at the visibility issue. Sharpen our thoughts about where and when that gets set vs. queried. This table by @jzhaoo is good background research.
Sheet Type | Public? (Published to the web) | URL visibility setting | Response | Content Type | Content |
---|---|---|---|---|---|
My sheet | Yes | private | 200 OK | application/atom+xml | data |
My sheet | Yes | public | 200 OK | application/atom+xml | data |
My sheet | No | private | 200 OK | application/atom+xml | data |
My sheet | No | public | 200 OK | text/html | this doc is not public |
Someone else's sheet | Yes | private | 403 Forbidden | Error | Error |
Someone else's sheet | Yes | public | 200 OK | application/atom+xml | data |
Someone else's sheet | No | private | 200 OK | application/atom+xml | data |
Someone else's sheet | No | public | 200 OK | text/html | this doc is not public |
I'm not sure I approve of worksheets as first-class objects. By definition, an actual worksheet must live in a spreadsheet. We need to think about whether it is legit to ever define an R object that represents a specific worksheet. I'd like to consider re-factoring all such workflows to explicitly operate within a spreadsheet, with user specifying which worksheet s/he wants to consume or edit.
This effects whether, for example, the open_worksheet
function should even exist.
UPDATE: I have removed any notion of a worksheet as a first-class object.
I discovered that our str
and print
(and maybe view
as well?) methods made actual HTTP calls, which were potentially quite slow. I think str
and print
should be strictly local operations. In general, I think we should never make a request to the outside world inside a function where the user wouldn't expect it.
The str
and print
methods seemed very similar. So I got rid of the print
methods and just kept str
.
In order to learn where the data lives in a worksheet, you have to actually get the data. Right now, we have a function that learns worksheet dimensions but it seems wasteful. Since we've gotten the data, we should actually store it in case the user is about to request it anyway. I haven't done anything about this because I haven't started to refactor the data consumption functions yet.
I've now found that find_cell()
also retrieves all the data via the cell feed then essentially discards it. I'm deleting this function. It would be more efficient to use ws_to_df()
(a new function I've written that is similar to the old get_lookup_tbl()
) to get an entire worksheet as a local data.frame, then determine which cell has the value being sought. Deleting find_cell()
. It was not called by any other function. Ditto all of the above for find_all()
.
UPDATE: I've deleted all the functions that called the cell feed just to determine facts about a worksheet. We help people download data. If you need to know the dimensions, then download it and inspect it! The API will not reveal this information w/o download, so it's out of our hands.
I am gradually switching over to Hadley's recommendation, which is to default to calling functions from other packages like so: plyr::laply()
or stringr::str_replace()
. If something gets used over and over again, then I think we should use the roxygen @importFrom(this, that)
and put it in the comment header that generates documentation for the gspreadr
package.
What sort of objects do we inflict on the user?
Do we pick one model or the other? Or is their a role or need for both?
Helpful quote from @sckott:
I usually have e.g., a function that a user interacts with, but internally there is a separate function that takes care of the GET request, checking http headers if applicable, making sure the status code is in the acceptable range, parsing error messages, etc. But all that highly depends on the API you're working with. And the function the user interacts with gives back usually a list or data.frame. If there is metadata about the call that the user might want to see I usually return a list with the data as a data.frame, and metadata in another slot, which makes it easy to index to just the data. Or you can store metadata in the attributes so that the returned object can be a data.frame if that's easier.
I think we should default to keeping and parsing everything that comes back from our HTTP requests. For example, I'm storing all the links that come back now as a data.frame. Then we should also use that information whenever it's needed. For example, some of the URLs we now construct by hand are actually sitting there as part of what's returned from a previous request. It would be a better practice to never re-construct information that we've already gotten officially from the server.
Here's an example where we construct the URL for the cell feed but, in fact, we were given that when we opened the spreadsheet (or worksheet? I can't remember) in the first place:
I rewrote the helper functions that convert column IDs from letter form (e.g., "AB") to numeric (e.g., 28) and back again. They are now vectorized and have no for or while loops.
When in doubt about what to call some thing or activity, try to find precedent in the API docs or in the stuff returned by the server. For example, we should talk about spreadsheet and worksheet titles not names. Must also be careful around the use of the word ID. This is why we now talk about "consuming" the data.
Do we have one or more specific spreadsheets we are testing against? If so, instead of hard-wiring unreadable spreadsheet keys into various tests, the keys (or URLs or titles or whatever) should be stored in one central place, with explanatory comments, and utilized throughout the tests.
UPDATE: I've created a helper.R
file inside the testthat directory where I initialize various objects for using gspreadr
-controlled spreadsheets in tests.
The helper.R
file also implement a hack for logging into Google as the gspreadr
user, so we can test against the API as an authenticated user.
@jzhaoo
Over the past week I've made lots of changes. I will push when things are put together enough to be functional again. Here are notes about my motivations and questions we face. We can talk about all this when we meet. When I make this eventual commit and PR and merge, I'll close this.