mllg / batchtools

Tools for computation on batch systems
https://mllg.github.io/batchtools/
GNU Lesser General Public License v3.0
169 stars 51 forks source link

system2: Cannot allocate memory #259

Closed nick-youngblut closed 4 years ago

nick-youngblut commented 4 years ago

I'm getting the following error only when running many (n=100) of essentially the same job (just permutations), but no such error when running just a few of the random permutation jobs:

Error in system2(command = sys.cmd, args = sys.args, stdin = stdin, stdout = TRUE, : cannot popen ''qstat' -u $USER -s rs 2>&1', probable reason 'Cannot allocate memory'
Traceback:

1. nperm %>% seq %>% as.list() %>% future_map(procrustes_perm, phy1 = gen_phy, 
 .     phy2 = trt_jac, ntaxa = 10000)
2. withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
3. eval(quote(`_fseq`(`_lhs`)), env, env)
4. eval(quote(`_fseq`(`_lhs`)), env, env)
5. `_fseq`(`_lhs`)
6. freduce(value, `_function_list`)
7. withVisible(function_list[[k]](value))
8. function_list[[k]](value)
9. future_map(., procrustes_perm, phy1 = gen_phy, phy2 = trt_jac, 
 .     ntaxa = 10000)
10. future_map_template(purrr::map, "list", .x, .f, ..., .progress = .progress, 
  .     .options = .options)
11. multi_resolve(fs, names(.x))
12. values(fs)
13. values.list(fs)
14. resolve(y, result = TRUE, stdout = stdout, signal = signal, force = TRUE)
15. resolve.list(y, result = TRUE, stdout = stdout, signal = signal, 
  .     force = TRUE)
16. value(obj, stdout = FALSE, signal = FALSE)
17. value.Future(obj, stdout = FALSE, signal = FALSE)
18. result(future)
19. result.BatchtoolsFuture(future)
20. await(future, cleanup = FALSE)
21. await.BatchtoolsFuture(future, cleanup = FALSE)
22. status(future)
23. status.BatchtoolsFuture(future)
24. get_status(reg = reg, ids = jobid)
25. batchtools::getStatus(...)
26. getStatusTable(convertIds(reg, ids), reg = reg)
27. merge(filter(reg$status, ids), batch.ids, by = "batch.id", all.x = TRUE, 
  .     all.y = FALSE, sort = FALSE)
28. merge.data.table(filter(reg$status, ids), batch.ids, by = "batch.id", 
  .     all.x = TRUE, all.y = FALSE, sort = FALSE)
29. is.data.table(y)
30. getBatchIds(reg = reg)
31. unique(cf$listJobsRunning(reg))
32. cf$listJobsRunning(reg)
33. listJobs(reg, c("-u $USER", "-s rs"))
34. runOSCommand("qstat", args, nodename = nodename)
35. suppressWarnings(system2(command = sys.cmd, args = sys.args, 
  .     stdin = stdin, stdout = TRUE, stderr = TRUE, wait = TRUE))
36. withCallingHandlers(expr, warning = function(w) invokeRestart("muffleWarning"))
37. system2(command = sys.cmd, args = sys.args, stdin = stdin, stdout = TRUE, 
  .     stderr = TRUE, wait = TRUE)

The fact that the error occurs only when a large number of jobs are run makes me think that the memory error concerns the final aggregation of the data, but the R objects that are returned from each job should be relatively small.

Also, the error doesn't seem to make much sense to me, given that a qstat call shouldn't take up much memory, especially on a server with 1Tb of memory. To be clear, the SGE jobs that I'm submitting each have 80 Gb allocated, and these jobs require less than half that (I'm just making sure there's enough memory), and submitting just a few of the exact same jobs doesn't cause this "out of memory" error.

For now, I am just separating my future_map() jobs into batches of 10. These finish without any errors unlike when I run all 100 jobs together instead of the 10 batches.

sessionInfo:

R version 3.6.2 (2019-12-12)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: Ubuntu 18.04.3 LTS

Matrix products: default
BLAS/LAPACK: /ebio/abt3_projects/Georg_animal_feces/envs/phyloseq-phy/lib/libopenblasp-r0.3.7.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] furrr_0.1.0             purrr_0.3.3             future.batchtools_0.8.1
 [4] future_1.15.1           openblasctl_0.1-0       paco_0.4.1             
 [7] LeyLabRMisc_0.1.3       ape_5.3                 tidytable_0.3.2        
[10] data.table_1.12.8       ggplot2_3.2.1           tidyr_1.0.0            
[13] dplyr_0.8.3            

loaded via a namespace (and not attached):
 [1] maps_3.3.0              jsonlite_1.6            splines_3.6.2          
 [4] foreach_1.4.7           gtools_3.8.1            assertthat_0.2.1       
 [7] expm_0.999-4            animation_2.6           base64url_1.4          
[10] progress_1.2.2          globals_0.12.5          numDeriv_2016.8-1.1    
[13] pillar_1.4.3            backports_1.1.5         lattice_0.20-38        
[16] glue_1.3.1              quadprog_1.5-8          phangorn_2.5.5         
[19] uuid_0.1-2              digest_0.6.23           checkmate_1.9.4        
[22] colorspace_1.4-1        htmltools_0.4.0         Matrix_1.2-18          
[25] plyr_1.8.5              pkgconfig_2.0.3         listenv_0.8.0          
[28] scales_1.1.0            brew_1.0-6              tibble_2.1.3           
[31] combinat_0.0-8          mgcv_1.8-31             farver_2.0.2           
[34] withr_2.1.2             repr_1.0.2              lazyeval_0.2.2         
[37] mnormt_1.5-6            magrittr_1.5            crayon_1.3.4           
[40] evaluate_0.14           fs_1.3.1                doParallel_1.0.15      
[43] nlme_3.1-143            MASS_7.3-51.5           vegan_2.5-6            
[46] prettyunits_1.1.0       tools_3.6.2             hms_0.5.3              
[49] lifecycle_0.1.0         phytools_0.6-99         munsell_0.5.0          
[52] cluster_2.1.0           plotrix_3.7-7           compiler_3.6.2         
[55] clusterGeneration_1.3.4 rlang_0.4.2             grid_3.6.2             
[58] pbdZMQ_0.3-3            iterators_1.0.12        IRkernel_1.1           
[61] rappdirs_0.3.1          igraph_1.2.4.2          base64enc_0.1-3        
[64] labeling_0.3            gtable_0.3.0            codetools_0.2-16       
[67] R6_2.4.1                zeallot_0.1.0           fastmatch_1.1-0        
[70] permute_0.9-5           stringi_1.4.5           parallel_3.6.2         
[73] IRdisplay_0.7.0         Rcpp_1.0.3              vctrs_0.2.1            
[76] scatterplot3d_0.3-41    batchtools_0.9.12       tidyselect_0.2.5       
[79] coda_0.19-3 
nick-youngblut commented 4 years ago

After more testing, it seems that the issue was that the total size of the objects returned by each job was too large for the memory on my machine.