Cannot allocate vector size --- any suggestion dealing with the India DHS?

liuyanguu commented 2 years ago

Many thanks for the great package, it runs very fast! I am working on calculating U5MR from birth history for the latest India DHS 2019. As we know the India DHS datasets are much larger than any other DHS. Just the input file selecting only the columns we need is over 100MB.

Running calc_nqx easily triggers the error like "cannot allocate vector of size 14.5 Gb", any experience on that would be much appreciated.

This is not a reproducible code, but you get the idea

library("demogsurv")
library("data.table")
dir_sav_input_bh <- "...." 
dt1 <- setDT(foreign::read.spss(dir_sav_input_bh, to.data.frame = TRUE, use.value.labels = FALSE))
setnames(dt1, tolower(colnames(dt1)))
dt1[dt1==-99] <- NA
dt1[, `:=`(death = (b5 == 0), dod = b3 + b7 + 0.5)]
u5mr <- calc_nqx(dt1, strata = ~v022)

The input data I used can be downloaded from Dropbox (100 MB): Dropbox link to the file

jeffeaton commented 2 years ago

Thanks very much. I haven't worked with the India DHS very much, but I'm not too surprised it runs into a memory issue.

If you still have this open / are able to reproduce again, can you call traceback() after the error to see exactly where it is occurring?

It might be failing in the calls to the survey package. If so, you could try the jackknife standard error instead:

u5mr <- calc_nqx(dt1, strata = ~v022, varmethod = "jk1")

liuyanguu commented 2 years ago

Thank you so much for your quick response! Here is what's returned from traceback()

Error: cannot allocate vector of size 14.5 Gb
> traceback()
4: double(osize)
3: pyears(formula, data, scale = scale, data.frame = TRUE, weights = weights)
2: demog_pyears(f, mf, period = period, agegr = agegr, tips = tips, 
       event = "(death)", tstart = "(dob)", tstop = "(tstop)", weights = "(weights)", 
       origin = origin, scale = scale)
1: calc_nqx(dt1, strata = ~v022, varmethod = "jk1")

Looks like just the first step in calc_nqx, which calls survival::pyears?

liuyanguu commented 2 years ago

Indeed the issue was raised by pyears

>   pyears(formula, data, scale=scale, data.frame=TRUE, weights=weights)
Error in rowSums(is.na(unclass(x))) : 
  'Calloc' could not allocate memory (1274250 of 16 bytes)
> traceback()
11: rowSums(is.na(unclass(x)))
10: as.vector(rowSums(is.na(unclass(x))) > 0)
9: is.na.Surv(x)
8: is.na(x)
7: na.omit.data.frame(structure(list(`Surv(`(tstop)` - `(dob)`, `(death)`)` = structure(c(106, 
   135, 173, 39, 112, 175, 190, 6, 6, 0.690000000000055, 0.650000000000091, 
   0.619999999999891, 53, 15, 111, 138, 164, 18, 77, 101, 127, 78, 
   98, 218, 279, 306, 325, 358, 27, 76, 0.5, 28, 40, 68, 90, 65, 
   80, 102, 79, 85, 124, 147, 273, 2, 139, 175, 208, 238, 194, 226, 
   237, 271, 130, 164, 209, 237, 47, 63, 6, 43, 43, 118, 149, 210, 
   234, 269, 309, 8, 76, 105, 130, 23, 45, 111, 160, 187, 159, 199, 
   220, 0.539999999999964, 17, 43, 75, 126, 2.5, 49, 89, 23, 63, 
   72.5, 193, 222, 122, 168, 210, 245, 101, 146, 174, 216.5, 91, 
   113, 174, 236, 269, 279, 307, 124, 242, 269, 319, 78, 103, 125, 
   52, 92, 139, 79, 98, 125, 21, 43, 77, 103, 124, 150, 140, 172, 
   199, 221, 231, 259, 272, 100, 117, 135, 163, 192, 157, 208, 249, 
   84, 141, 186, 226, 249, 29, 56, 172, 206, 250, 270, 7, 77, 106, 
   150, 21, 43, 163, 214, 223, 250, 18, 256, 293, 115, 161, 262, 
   262, 283, 308, 45, 117, 1, 152, 38, 66, 1, 133, 176, 209, 68, 
   100, 137, 112, 138, 157, 188, 29, 87, 114, 69, 116, 106, 133, 
   187, 44, 66, 53, 93, 32, 233, 0.740000000000009, 298, 319, 245, 
   266, 296, 320, 141, 169, 141, 162, 211, 218, 28, 230, 256, 293, 
   332, 32, 62, 102, 136, 52, 85, 110, 204, 171, 212, 75, 101, 41, 
   81, 103, 34, 65, 92, 143, 214, 57, 90, 283, 344, 134, 34, 80, 
    ...
6: na.omit(structure(list(`Surv(`(tstop)` - `(dob)`, `(death)`)` = structure(c(106, 
   135, 173, 39, 112, 175, 190, 6, 6, 0.690000000000055, 0.650000000000091, 
   0.619999999999891, 53, 15, 111, 138, 164, 18, 77, 101, 127, 78, 
   98, 218, 279, 306, 325, 358, 27, 76, 0.5, 28, 40, 68, 90, 65, 
   80, 102, 79, 85, 124, 147, 273, 2, 139, 175, 208, 238, 194, 226, 
   237, 271, 130, 164, 209, 237, 47, 63, 6, 43, 43, 118, 149, 210, 
   234, 269, 309, 8, 76, 105, 130, 23, 45, 111, 160, 187, 159, 199, 
   220, 0.539999999999964, 17, 43, 75, 126, 2.5, 49, 89, 23, 63, 
   72.5, 193, 222, 122, 168, 210, 245, 101, 146, 174, 216.5, 91, 
   113, 174, 236, 269, 279, 307, 124, 242, 269, 319, 78, 103, 125, 
   52, 92, 139, 79, 98, 125, 21, 43, 77, 103, 124, 150, 140, 172, 
   199, 221, 231, 259, 272, 100, 117, 135, 163, 192, 157, 208, 249, 
   84, 141, 186, 226, 249, 29, 56, 172, 206, 250, 270, 7, 77, 106, 
   150, 21, 43, 163, 214, 223, 250, 18, 256, 293, 115, 161, 262, 
   262, 283, 308, 45, 117, 1, 152, 38, 66, 1, 133, 176, 209, 68, 
   100, 137, 112, 138, 157, 188, 29, 87, 114, 69, 116, 106, 133, 
   187, 44, 66, 53, 93, 32, 233, 0.740000000000009, 298, 319, 245, 
   266, 296, 320, 141, 169, 141, 162, 211, 218, 28, 230, 256, 293, 
   332, 32, 62, 102, 136, 52, 85, 110, 204, 171, 212, 75, 101, 41, 
   81, 103, 34, 65, 92, 143, 214, 57, 90, 283, 344, 134, 34, 80, 
    ...
5: model.frame.default(formula = formula, data = data, weights = weights)
4: stats::model.frame(formula = formula, data = data, weights = weights)
3: eval(tform, parent.frame())
2: eval(tform, parent.frame())
1: pyears(formula, data, scale = scale, data.frame = TRUE, weights = weights)

jeffeaton commented 2 years ago

Hi @liuyanguu,

Thanks for this—very helpful. In the branch issue-15 I have changed calc_nqx() to process the data through demog_pyears() in batches to avoid memory allocation error (default is set to batch_size = 100000).

Could you try installing that branch and testing again with your India example? devtools::install_github("mrc-ide/demogsurv@issue-15)

Do you need to do any of the other calculations (e.g. fertility) on the India data set? It might be an issue for those as well.

Thanks, Jeff

liuyanguu commented 2 years ago

Wonderful! Thank you so much for such a prompt reply! It works. I have an extra question, I see the latest available period is 2021, what reference period does it refer to? Is it for all the deaths that happened in the calendar year 2021 (Jan.-Dec.)?

jeffeaton commented 2 years ago

Yes that refers to calendar year 2021. You can adjust the time splits using the period argument.

Thanks, Jeff

Message ID: @.***>

mrc-ide / demogsurv

Cannot allocate vector size --- any suggestion dealing with the India DHS? #15