Feature: Compressed Cache

mwaldstein commented 6 years ago

Problem For my package, I'm parsing poorly structured HTML, leading to the need to test a large number of pages, in turn leading to a very large cache (48 MB and ever increasing).

Human readability of the cache is not required in my case, so there would be a huge savings in disk space and package size to compress the cache.

Solution Outline Reading the cache would have an additional check if a compressed version exists should the initial file not be present.

Writing would create a compressed version if an environment variable is set (httpcache.compress = T?)

Alternatives My plan B is to move to use httptest only for dev testing and all CRAN testing on manually downloaded copies.

I can do the implementation work if useful.

nealrichardson commented 6 years ago

Wow, that's a lot! Compression is an interesting idea, and I can see how one might implement that. But I wonder if compressing the recorded response cache would actually achieve your goals because:

R CMD build already generates a gzipped file, so the CRAN submission is already compressed and won't compress further if you compress individual files within it
Even compressed, you're getting up against the 5mb package limit for CRAN submissions
git(hub) doesn't handle binary files well and your diffs will explode whenever you re-record them.

So compression or not, you probably need to find a way to reduce the uncompressed size of the response cache. Here are some ideas for how to do that:

Use a custom redacting function when recording responses to remove some cruft from the html. By applying a few regular expressions to delete the inline style in the html (using a helper function, gsub_response, that will be included in the upcoming httptest release, but this is possible in the current version as well), I was able to reduce the size of the uncompressed cache directory by over a half (22.6mb, down from 49.6) and cut about 20% off of the built (compressed) package size. The code I used to apply this to the cached files in the current repo is below. All but 4 tests still pass. Since you understand better how you're extracting data from the html, you could probably squeeze even more out this way (and fix/alter the failing tests).
Manually prune some of the larger html files after you record them if you can't figure out a scriptable way that you can include in a redacting function. From what I can tell, the tests aren't asserting too many details about the contents of these large html files.
Test the parsing complications using minimal unit test cases rather than full-blown html responses. That is, have only a small number of end-to-end tests that do the right (mocked) requests and handle the responses, and then test your parsing logic apart from those. That way, you need fewer large cached responses.
If all else fails, consider creating a separate data package that holds the response cache for your tests.

What do you think?

# This is the redacting function you would pass to `capture_requests`
# (or put in inst/httptest/redact.R for automatic application in the upcoming 
# package release)
redactor <- function (response) {
    require(magrittr)
    response %>%
        gsub_response("[ \n]STYLE=.*?>", ">", ignore.case=TRUE) %>%
        gsub_response("(&nbsp;)+", " ")
}

files <- dir(path="tests/cache/data", pattern="\\.R$", recursive=TRUE, full.names=TRUE)
lapply(files, function (f) {
    r <- source(f)$value
    r <- redactor(r)
    save_response(r, f, simplify=FALSE)
})

mwaldstein commented 6 years ago

Thanks for all the investigation!

Thinking this through, I'm realizing that to some extent, I'm "abusing" httptest, using it to manage my testdata - the (large) files I'm testing are in reality static, but in most workflows are fetched online which is why I was using httptest.

The right solution for me will be to switch to manual saving the documents and careful .RIgnore & skip_cran once I hit the package size limit.

Part of the reason for quite so many tests (and my hesitancy at redacting) is that the particular corner of the package where this is an issue has a lot of very fragile finicky xpath, where as I fix defects, it is really common to introduce regressions. The volume of tests are constantly running thanks to live-test to keep me honest...

Thanks for the help thinking this through!

nealrichardson / httptest

Feature: Compressed Cache #11