nealrichardson / httptest

A Test Environment for HTTP Requests in R
https://enpiar.com/r/httptest/
Other
79 stars 10 forks source link

Feature: Compressed Cache #11

Open mwaldstein opened 6 years ago

mwaldstein commented 6 years ago

Problem For my package, I'm parsing poorly structured HTML, leading to the need to test a large number of pages, in turn leading to a very large cache (48 MB and ever increasing).

Human readability of the cache is not required in my case, so there would be a huge savings in disk space and package size to compress the cache.

Solution Outline Reading the cache would have an additional check if a compressed version exists should the initial file not be present.

Writing would create a compressed version if an environment variable is set (httpcache.compress = T?)

Alternatives My plan B is to move to use httptest only for dev testing and all CRAN testing on manually downloaded copies.

I can do the implementation work if useful.

nealrichardson commented 6 years ago

Wow, that's a lot! Compression is an interesting idea, and I can see how one might implement that. But I wonder if compressing the recorded response cache would actually achieve your goals because:

So compression or not, you probably need to find a way to reduce the uncompressed size of the response cache. Here are some ideas for how to do that:

What do you think?

# This is the redacting function you would pass to `capture_requests`
# (or put in inst/httptest/redact.R for automatic application in the upcoming 
# package release)
redactor <- function (response) {
    require(magrittr)
    response %>%
        gsub_response("[ \n]STYLE=.*?>", ">", ignore.case=TRUE) %>%
        gsub_response("(&nbsp;)+", " ")
}

files <- dir(path="tests/cache/data", pattern="\\.R$", recursive=TRUE, full.names=TRUE)
lapply(files, function (f) {
    r <- source(f)$value
    r <- redactor(r)
    save_response(r, f, simplify=FALSE)
})
mwaldstein commented 6 years ago

Thanks for all the investigation!

Thinking this through, I'm realizing that to some extent, I'm "abusing" httptest, using it to manage my testdata - the (large) files I'm testing are in reality static, but in most workflows are fetched online which is why I was using httptest.

The right solution for me will be to switch to manual saving the documents and careful .RIgnore & skip_cran once I hit the package size limit.

Part of the reason for quite so many tests (and my hesitancy at redacting) is that the particular corner of the package where this is an issue has a lot of very fragile finicky xpath, where as I fix defects, it is really common to introduce regressions. The volume of tests are constantly running thanks to live-test to keep me honest...

Thanks for the help thinking this through!