Open kevinwolz opened 6 years ago
Here's some "training data" for estimating file siwe simply base on the number of spreadsheet cells in the output file. The above function should calculate the number of spreadsheet cells expected in the output file and then use this regression to calculate file size.
training.data <- dplyr::tibble(size.mb = c(10.3, 12.2, 3.1, 2.9, 46.6,
114.3, 0.427, 28.1, 0.703, 655.7,
5.1, 0.082, 167.8, 100.7, 830.7,
393.6, 9.6, 92.9, 7.9, 51.9,
499.4, 167.8),
n.cells = c(915800, 659855, 360684, 263010, 4195233,
5850073, 38456, 1422127, 60214, 30547935,
482000, 8942, 9900468, 6204870, 32768799,
21501935, 867844, 7206738, 710127, 2965578,
20968013, 7705206))
lm.out <- lm(size.mb ~ n.cells - 1, data = training.data)
poly.out <- lm(size.mb ~ n.cells + I(n.cells^2) , data = training.data)
x.pred <- seq(min(training.data$n.cells), max(training.data$n.cells), 1000)
poly.pred <- dplyr::tibble(size.mb = predict(poly.out, data.frame(n.cells = x.pred)),
n.cells = x.pred)
poly.pred.man <- dplyr::tibble(size.mb = coef(poly.out)[1] + coef(poly.out)[2] * x.pred + coef(poly.out)[3] * x.pred ^ 2,
n.cells = x.pred)
ggplot(training.data, aes(x = n.cells, y = size.mb)) +
labs(x = "Number of cells in file", y = "File size (Mb)") +
geom_point() +
geom_abline(slope = coef(lm.out)) +
geom_line(data = poly.pred, color = "red") +
geom_line(data = poly.pred.man, color = "blue", linetype = "dashed")
summary(poly.out)
coef(poly.out)
Here's a start: