tidyverse / vroom

Fast reading of delimited files
https://vroom.r-lib.org
Other
620 stars 60 forks source link

Segfault when calling `read_tsv()` on an HPC cluster #533

Open smped opened 6 months ago

smped commented 6 months ago

Hi,

I'm having an issue with read_tsv() which appears to be the segfault mentioned here: https://github.com/tidyverse/vroom/issues/510

I'm calling the function inside a conda environment on an HPC. Running it interactively on the file in the conda environment on the head node works fine, but when running as a job within the cluster I get a segfault every time, which is all way above my skill level.

The error I see in my log files is:

*** caught segfault ***
address (nil), cause 'memory not mapped'

Traceback:
 1: vroom_(file, delim = delim %||% col_types$delim, col_names = col_names,     col_types = col_types, id = id, skip = skip, col_select = col_select,     name_repair = .name_repair, na = na, quote = quote, trim_ws = trim_ws,     escape_double = escape_double, escape_backslash = escape_backslash,     comment = comment, skip_empty_rows = skip_empty_rows, locale = locale,     guess_max = guess_max, n_max = n_max, altrep = vroom_altrep(altrep),     num_threads = num_threads, progress = progress)
 2: vroom::vroom(file, delim = "\t", col_names = col_names, col_types = col_types,     col_select = {        {            col_select        }    }, id = id, .name_repair = name_repair, skip = skip, n_max = n_max,     na = na, quote = quote, comment = comment, skip_empty_rows = skip_empty_rows,     trim_ws = trim_ws, escape_double = TRUE, escape_backslash = FALSE,     locale = locale, guess_max = guess_max, show_col_types = show_col_types,     progress = progress, altrep = lazy, num_threads = num_threads)
 3: fn(x)
 4: FUN(X[[i]], ...)
 5: lapply(rna_files, function(x) {    ln <- readLines(x, 1)    fn <- paste0("read_", ifelse(grepl("\\t", ln), "tsv", "csv"))    fn <- match.fun(fn)    df <- fn(x)    gn_col <- intersect(c("gene_id", "Geneid"), names(df))[[1]]    fc_col <- intersect(c("logFC", "logfc"), names(df))[[1]]    fdr_col <- intersect(c("fdr", "FDR", "adjP", "adj_p"), names(df))[[1]]    dplyr::select(df, gene_id = !!sym(gn_col), logFC = !!sym(fc_col),         FDR = !!sym(fdr_col))})
 6: lapply(rna_files, function(x) {    ln <- readLines(x, 1)    fn <- paste0("read_", ifelse(grepl("\\t", ln), "tsv", "csv"))    fn <- match.fun(fn)    df <- fn(x)    gn_col <- intersect(c("gene_id", "Geneid"), names(df))[[1]]    fc_col <- intersect(c("logFC", "logfc"), names(df))[[1]]    fdr_col <- intersect(c("fdr", "FDR", "adjP", "adj_p"), names(df))[[1]]    dplyr::select(df, gene_id = !!sym(gn_col), logFC = !!sym(fc_col),         FDR = !!sym(fdr_col))})

Is that vroom release mentioned in the above issue able to be released soon? I notice it's still at v1.6.5.***.

Relevant package versions & the HPC OS below, however this is from the head node. When I look at other files where I've printed a sessionInfo() when running on the cluster, I don't seem to get the Running under: Red Hat Enterprise Linux 8.4 (Ootpa) and Matrix products: default BLAS/LAPACK: /hpcfs/users/******/envs/f4994948c5b33369acc304940a5fa825_/lib/libopenblasp-r0.3.26.so; LAPACK version 3.12.0 lines. I'm not sure if that's helpful information or not though.

sessionInfo()
R version 4.3.3 (2024-02-29)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux 8.4 (Ootpa)

Matrix products: default
BLAS/LAPACK: /hpcfs/users/******/envs/f4994948c5b33369acc304940a5fa825_/lib/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8    
 [5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8   
 [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       

time zone: Australia/Adelaide
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] vroom_1.6.5 readr_2.1.5

loaded via a namespace (and not attached):
 [1] utf8_1.2.4       R6_2.5.1         tidyselect_1.2.0 bit_4.0.5       
 [5] tzdb_0.4.0       magrittr_2.0.3   glue_1.7.0       tibble_3.2.1    
 [9] pkgconfig_2.0.3  bit64_4.0.5      lifecycle_1.0.4  cli_3.6.2       
[13] fansi_1.0.6      vctrs_0.6.5      compiler_4.3.3   hms_1.1.3       
[17] pillar_1.9.0     crayon_1.5.2     rlang_1.1.3