jennybc / googlesheets

Google Spreadsheets R API
Other
783 stars 191 forks source link

Refactoring, part 1 #12

Closed jennybc closed 9 years ago

jennybc commented 9 years ago

@jzhaoo

Over the past week I've made lots of changes. I will push when things are put together enough to be functional again. Here are notes about my motivations and questions we face. We can talk about all this when we meet. When I make this eventual commit and PR and merge, I'll close this.

jennybc commented 9 years ago

XML parsing

This struck me as a pain point. The original gsheets_GET function was (I think) always followed immediately by a call to gsheets_parse, which suggests we should move that parsing into gsheets_GET. I've done that now.

I then got a chance to share your XML parsing pain, using XML::getNodeSet and xpathSApply. I actually think these lines may have been stumbling into a memory leak problem that is apparently a "gotcha" just waiting to happen with the XML package:

The only thing of concern to us here is when we are dealing with internal/C-representations. The critical issue is simply this. Suppose we have an XML document created by either xmlParse() or newXMLDoc(). Then we use some mechanism (e.g. XPath or direct manipulation of the nodes) to get access to one or more nodes. Now, if we remove doc or simply return nodes from a function that has doc as a local variable, we have a problem.

This is also mentioned in examples in the help file. I even re-wrote some code to address this and have stashed locally, but it's a moot point now because I ultimately decided to use XML::xmlToList and get the information out of XML and into an R native list ASAP (more below). I think this was quite likely contributing to the mysteriously slow performance of calls to our functions to open a spreadsheet, especially calls subsequent to an initial call.

I am concerned I will regret this policy decision, for example, once I get to refactoring the functions that update the spreadsheet. Obviously I will need to send XML so maybe it's untenable to abandon XML for R lists they way I have.

Hadley mentioned rvest in the context of XML and HTML parsing, even if you are not web scraping. So if I backtrack and we get back into the XML parsing business, we should play with rvest.

Examples suggested by @sckott of XML parsing in API wrapper packages:

Another thing I rationalized in my stashed XML parsing code is the handling of the namespaces. I pull them directly out of the XML and feed them back to functions from the XML package.

Links that were helpful to me during my brief battle with the XML:

Bottom line: I have gotten us out of the XML parsing business as quickly as possible. Unclear if this is desirable or sustainable long-term.

Functional programming with lists

The decision to apply XML::xmlToList early leaves us with lots of ugly lists. Ideally we would then use a list-friendly functional-programming-oriented package like rlist or purrr. I haven't taken this plunge yet because … neither has been around a long time. For example, I can't find any reverse dependencies on CRAN for rlist. Hadley confirmed it was a bit early to depend on lowliner (now purrr). So right now we have quite a bit of ugly lapply and plyr::l*ply code for dealing with these lists. I have even written some crude utility functions to filter and pluck list components based on element name. I think we should just get our basic functionality in place and check back in on the world of FP for lists in a few months.

Spreadsheet visibility

I want a comprehensive look at the visibility issue. Sharpen our thoughts about where and when that gets set vs. queried. This table by @jzhaoo is good background research.

Sheet Type Public? (Published to the web) URL visibility setting Response Content Type Content
My sheet Yes private 200 OK application/atom+xml data
My sheet Yes public 200 OK application/atom+xml data
My sheet No private 200 OK application/atom+xml data
My sheet No public 200 OK text/html this doc is not public
Someone else's sheet Yes private 403 Forbidden Error Error
Someone else's sheet Yes public 200 OK application/atom+xml data
Someone else's sheet No private 200 OK application/atom+xml data
Someone else's sheet No public 200 OK text/html this doc is not public

Worksheets as first-class objects

I'm not sure I approve of worksheets as first-class objects. By definition, an actual worksheet must live in a spreadsheet. We need to think about whether it is legit to ever define an R object that represents a specific worksheet. I'd like to consider re-factoring all such workflows to explicitly operate within a spreadsheet, with user specifying which worksheet s/he wants to consume or edit.

This effects whether, for example, the open_worksheet function should even exist.

UPDATE: I have removed any notion of a worksheet as a first-class object.

Unexpected stealthy HTTP requests

I discovered that our str and print (and maybe view as well?) methods made actual HTTP calls, which were potentially quite slow. I think str and print should be strictly local operations. In general, I think we should never make a request to the outside world inside a function where the user wouldn't expect it.

Print method

The str and print methods seemed very similar. So I got rid of the print methods and just kept str.

Getting worksheet dimensions

In order to learn where the data lives in a worksheet, you have to actually get the data. Right now, we have a function that learns worksheet dimensions but it seems wasteful. Since we've gotten the data, we should actually store it in case the user is about to request it anyway. I haven't done anything about this because I haven't started to refactor the data consumption functions yet.

I've now found that find_cell() also retrieves all the data via the cell feed then essentially discards it. I'm deleting this function. It would be more efficient to use ws_to_df() (a new function I've written that is similar to the old get_lookup_tbl()) to get an entire worksheet as a local data.frame, then determine which cell has the value being sought. Deleting find_cell(). It was not called by any other function. Ditto all of the above for find_all().

UPDATE: I've deleted all the functions that called the cell feed just to determine facts about a worksheet. We help people download data. If you need to know the dimensions, then download it and inspect it! The API will not reveal this information w/o download, so it's out of our hands.

Imports from other packages

I am gradually switching over to Hadley's recommendation, which is to default to calling functions from other packages like so: plyr::laply() or stringr::str_replace(). If something gets used over and over again, then I think we should use the roxygen @importFrom(this, that) and put it in the comment header that generates documentation for the gspreadr package.

What to give the user

What sort of objects do we inflict on the user?

Do we pick one model or the other? Or is their a role or need for both?

Helpful quote from @sckott:

I usually have e.g., a function that a user interacts with, but internally there is a separate function that takes care of the GET request, checking http headers if applicable, making sure the status code is in the acceptable range, parsing error messages, etc. But all that highly depends on the API you're working with. And the function the user interacts with gives back usually a list or data.frame. If there is metadata about the call that the user might want to see I usually return a list with the data as a data.frame, and metadata in another slot, which makes it easy to index to just the data. Or you can store metadata in the attributes so that the returned object can be a data.frame if that's easier.

Hoard information and don't re-derive it

I think we should default to keeping and parsing everything that comes back from our HTTP requests. For example, I'm storing all the links that come back now as a data.frame. Then we should also use that information whenever it's needed. For example, some of the URLs we now construct by hand are actually sitting there as part of what's returned from a previous request. It would be a better practice to never re-construct information that we've already gotten officially from the server.

Here's an example where we construct the URL for the cell feed but, in fact, we were given that when we opened the spreadsheet (or worksheet? I can't remember) in the first place:

https://github.com/jennybc/gspreadr/blob/4a14f5751f2b52db7d7ecbf9dc90262edaf60ab0/R/consume-data.R#L251

Conversions between different row/column ID systems

I rewrote the helper functions that convert column IDs from letter form (e.g., "AB") to numeric (e.g., 28) and back again. They are now vectorized and have no for or while loops.

Rigorous vocabulary

When in doubt about what to call some thing or activity, try to find precedent in the API docs or in the stuff returned by the server. For example, we should talk about spreadsheet and worksheet titles not names. Must also be careful around the use of the word ID. This is why we now talk about "consuming" the data.

Testing

Do we have one or more specific spreadsheets we are testing against? If so, instead of hard-wiring unreadable spreadsheet keys into various tests, the keys (or URLs or titles or whatever) should be stored in one central place, with explanatory comments, and utilized throughout the tests.

UPDATE: I've created a helper.R file inside the testthat directory where I initialize various objects for using gspreadr-controlled spreadsheets in tests.

The helper.R file also implement a hack for logging into Google as the gspreadr user, so we can test against the API as an authenticated user.