Open mwaldstein opened 6 years ago
Wow, that's a lot! Compression is an interesting idea, and I can see how one might implement that. But I wonder if compressing the recorded response cache would actually achieve your goals because:
R CMD build
already generates a gzipped file, so the CRAN submission is already compressed and won't compress further if you compress individual files within itSo compression or not, you probably need to find a way to reduce the uncompressed size of the response cache. Here are some ideas for how to do that:
gsub_response
, that will be included in the upcoming httptest release, but this is possible in the current version as well), I was able to reduce the size of the uncompressed cache
directory by over a half (22.6mb, down from 49.6) and cut about 20% off of the built (compressed) package size. The code I used to apply this to the cached files in the current repo is below. All but 4 tests still pass. Since you understand better how you're extracting data from the html, you could probably squeeze even more out this way (and fix/alter the failing tests). What do you think?
# This is the redacting function you would pass to `capture_requests`
# (or put in inst/httptest/redact.R for automatic application in the upcoming
# package release)
redactor <- function (response) {
require(magrittr)
response %>%
gsub_response("[ \n]STYLE=.*?>", ">", ignore.case=TRUE) %>%
gsub_response("( )+", " ")
}
files <- dir(path="tests/cache/data", pattern="\\.R$", recursive=TRUE, full.names=TRUE)
lapply(files, function (f) {
r <- source(f)$value
r <- redactor(r)
save_response(r, f, simplify=FALSE)
})
Thanks for all the investigation!
Thinking this through, I'm realizing that to some extent, I'm "abusing" httptest, using it to manage my testdata - the (large) files I'm testing are in reality static, but in most workflows are fetched online which is why I was using httptest.
The right solution for me will be to switch to manual saving the documents and careful .RIgnore & skip_cran once I hit the package size limit.
Part of the reason for quite so many tests (and my hesitancy at redacting) is that the particular corner of the package where this is an issue has a lot of very fragile finicky xpath, where as I fix defects, it is really common to introduce regressions. The volume of tests are constantly running thanks to live-test to keep me honest...
Thanks for the help thinking this through!
Problem For my package, I'm parsing poorly structured HTML, leading to the need to test a large number of pages, in turn leading to a very large cache (48 MB and ever increasing).
Human readability of the cache is not required in my case, so there would be a huge savings in disk space and package size to compress the cache.
Solution Outline Reading the cache would have an additional check if a compressed version exists should the initial file not be present.
Writing would create a compressed version if an environment variable is set (httpcache.compress = T?)
Alternatives My plan B is to move to use httptest only for dev testing and all CRAN testing on manually downloaded copies.
I can do the implementation work if useful.