mllg / batchtools

Tools for computation on batch systems
https://mllg.github.io/batchtools/
GNU Lesser General Public License v3.0
172 stars 51 forks source link

grepLogs() extremely slow #255

Closed fweber144 closed 4 years ago

fweber144 commented 4 years ago

When using batchtools::grepLogs() for checking the log files for a specific text pattern, it takes hours for me and I finally have to cancel because it takes too long. I have a registry with about 2.5 million jobs, so that's quite a lot, but if I check the log files manually using a conventional text editor, it takes just a few seconds. Is there a way batchtools::grepLogs() might be improved to run faster?

My sessionInfo():

R version 3.6.2 (2019-12-12)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252    LC_MONETARY=German_Germany.1252
[4] LC_NUMERIC=C                    LC_TIME=German_Germany.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] batchtools_0.9.12

loaded via a namespace (and not attached):
 [1] prettyunits_1.1.1 withr_2.1.2       digest_0.6.23     crayon_1.3.4      rappdirs_0.3.1    R6_2.4.1         
 [7] backports_1.1.5   rlang_0.4.4       progress_1.2.2    stringi_1.4.5     data.table_1.12.8 rstudioapi_0.10  
[13] brew_1.0-6        checkmate_1.9.4   vctrs_0.2.2       tools_3.6.2       hms_0.5.3         compiler_3.6.2   
[19] pkgconfig_2.0.3   base64url_1.4
mllg commented 4 years ago

I'm not sure what you mean by opening log files with a conventional text editor. Do you mean to grep from an editor over all log files is faster than using grepLogs()?

To speed things up, there are two options:

1) Move the registry to a faster file system, such as a local SSD. 2) The implementation in grepLogs() is naive, but as good as it gets with base R. You can try external, highly optimized tools such as ripgrep (https://github.com/BurntSushi/ripgrep) to grep for strings from the command line.

fweber144 commented 4 years ago

Thanks a lot for your reply. Yes, I meant opening each log file in a text editor such as Notepad++ and then use the editor's search function to search for the string (more specifically, the regular expression) that I would like to find using grepLogs().

Concerning your suggestions:

  1. The registry is already stored on a local SSD.
  2. Thanks for the suggestion. I'll have a look at alternative tools if there is no way to make grepLogs() run faster.
mllg commented 4 years ago

Yes, I meant opening each log file in a text editor such as Notepad++ and then use the editor's search function to search for the string (more specifically, the regular expression) that I would like to find using grepLogs().

In case you missed it from the docs: You can restrict which files to grep by providing a set of job ids, and you can open single log files with showLog().

fweber144 commented 4 years ago

Yes, I was aware of that feature. Perhaps I should have given the reason why I want to use grepLogs(): After running all my jobs, I want to retrieve any warning messages. The only batchtools way I found to retrieve warnings was grepLogs(). So I really have to grep through all my log files searching for the pattern "^Warning". I found the following workaround using data.table::fread():

log_files <- list.files(file.path("<path_to_registry>", "logs"), full.names = TRUE)
warn_any <- lapply(log_files, function(file_name_i){
  suppressWarnings({
    grep_warn <- fread(
      cmd = paste("grep \"^Warning\"", file_name_i),
      sep = NULL,
      header = FALSE,
      col.names = "warn_message"
    )
  })
  grep_warn[, file_name := file_name_i]
  return(grep_warn)
})
warn_any <- rbindlist(warn_any, fill = TRUE)

For me, this is a lot faster than grepLogs(pattern = "^Warning"). Similar speed to the data.table::fread() solution is obtained using base::system() in combination with grep (see this thread on SO):

rtools_path <- pkgbuild::rtools_path()
grep_path <- file.path(rtools_path, "grep.exe")
log_files <- list.files(file.path("<path_to_registry>", "logs"), full.names = TRUE)
warn_any <- lapply(log_files, function(file_name_i){
  sys_command <- paste(grep_path, "\"^Warning\"", file_name_i)
  suppressWarnings({
    system(sys_command, intern = TRUE)
  })
})

A downside of these two workarounds is that it's not so easy to get the job IDs corresponding to the retrieved warnings.