Datetime parsing randomly (but rarely) fails when using multiple threads and latin1 encoding

tidyverse / vroom

Fast reading of delimited files

Other

621 stars 60 forks source link

Thank you for the lovely package. When using vroom to parse a file with datetime values, with the latin1 encoding and more than one thread, randomly, but very rarely, it will report that certain times are not formatted as expected.

I have tried to make this example minimal, but because it isn't deterministic, I have had to guess at the size of data and number of replications needed to consistently generate at least one error. Below is code for the bug reproduction.

# Create test file.
times <- 
  c("31JAN2015:18:47:49", "31JAN2015:19:35:09", "31JAN2015:21:10:28", 
    "31JAN2015:20:02:19", "31JAN2015:18:04:39", "31JAN2015:19:58:32", 
    "31JAN2015:18:07:25", "31JAN2015:18:30:29", "31JAN2015:19:54:57", 
    "31JAN2015:20:17:13", "31JAN2015:19:44:46", "31JAN2015:20:30:18", 
    "31JAN2015:20:01:47", "31JAN2015:20:35:36", "31JAN2015:20:21:47", 
    "31JAN2015:18:39:52", "31JAN2015:20:51:51", "31JAN2015:21:26:30", 
    "31JAN2015:21:27:06", "31JAN2015:20:07:45", "31JAN2015:22:02:21", 
    "31JAN2015:20:35:48", "31JAN2015:20:23:30", "31JAN2015:21:10:12", 
    "31JAN2015:22:05:21", "31JAN2015:20:26:31", "31JAN2015:22:16:10", 
    "31JAN2015:22:11:14", "01FEB2015:01:08:45")

file <- tempfile()
write.csv(data.frame(a = times), 
          file, 
          row.names = FALSE,
          fileEncoding = "latin1")

library(vroom)

probs <- function(){
  test <-
    vroom::vroom(file,
                 delim = ";", # Can be anything not in times.
                 progress = FALSE, 
                 num_threads = 2, # Anything greater than 1
                 locale = locale(
                   encoding = "latin1" # Necessary for bug repro
                 ),
                 col_types = cols(
                   a = col_datetime(format = "%d%b%Y:%H:%M:%OS")
                 )
    )
  problems(test)
}

# Read test file 5000 times.
first <- replicate(5000, probs(), simplify = FALSE)
# Display all reads with problems.
first[sapply(first,nrow)>0]

I would expect that code to not fail on any read. Even if there was an error, I would expect it to be the same error every time. But on all machines I have tested you will get some reads that fail on random rows, like:

[[1]]
# A tibble: 1 x 5
    row   col expected                   actual             file                                                  
  <int> <int> <chr>                      <chr>              <chr>                                                 
1     6     1 date like %d%b%Y:%H:%M:%OS 31JAN2015:18:04:39 -

[[2]]
# A tibble: 1 x 5
    row   col expected                   actual             file                                                  
  <int> <int> <chr>                      <chr>              <chr>                                                 
1     8     1 date like %d%b%Y:%H:%M:%OS 31JAN2015:18:07:25 -

[[3]]
# A tibble: 1 x 5
    row   col expected                   actual             file                                                  
  <int> <int> <chr>                      <chr>              <chr>                                                 
1     9     1 date like %d%b%Y:%H:%M:%OS 31JAN2015:18:30:29 -

I have recreated this issue on Windows and Linux with vroom 1.5.7, with R version 4.1.3. I have also recreated this issue with the development version of vroom (1.6.0.9000). I also tested on R 3.6.3 on Linux.

library(vroom) times <- c("31JAN2015:18:47:49", "31JAN2015:19:35:09", "31JAN2015:21:10:28", "31JAN2015:20:02:19", "31JAN2015:18:04:39", "31JAN2015:19:58:32", "31JAN2015:18:07:25", "31JAN2015:18:30:29", "31JAN2015:19:54:57", "31JAN2015:20:17:13", "31JAN2015:19:44:46", "31JAN2015:20:30:18", "31JAN2015:20:01:47", "31JAN2015:20:35:36", "31JAN2015:20:21:47", "31JAN2015:18:39:52", "31JAN2015:20:51:51", "31JAN2015:21:26:30", "31JAN2015:21:27:06", "31JAN2015:20:07:45", "31JAN2015:22:02:21", "31JAN2015:20:35:48", "31JAN2015:20:23:30", "31JAN2015:21:10:12", "31JAN2015:22:05:21", "31JAN2015:20:26:31", "31JAN2015:22:16:10", "31JAN2015:22:11:14", "01FEB2015:01:08:45") file <- tempfile() write.csv(data.frame(a = times), file, row.names = FALSE, fileEncoding = "latin1") probs <- function() { test <- vroom( file, delim = ",", progress = FALSE, num_threads = 2, locale = locale(encoding = "latin1"), col_types = cols(a = col_datetime(format = "%d%b%Y:%H:%M:%OS")) ) problems(test) } first <- suppressWarnings(replicate(1000, probs(), simplify = FALSE)) dplyr::bind_rows(first, .id = "id") #> # A tibble: 14 × 6 #> id row col expected actual file #> <chr> <int> <int> <chr> <chr> <chr> #> 1 79 30 1 date like %d%b%Y:%H:%M:%OS 01FEB2015:01:08:45 /private/tmp… #> 2 85 30 1 date like %d%b%Y:%H:%M:%OS 01FEB2015:01:08:45 /private/tmp… #> 3 132 30 1 date like %d%b%Y:%H:%M:%OS 01FEB2015:01:08:45 /private/tmp… #> 4 133 30 1 date like %d%b%Y:%H:%M:%OS 01FEB2015:01:08:45 /private/tmp… #> 5 243 30 1 date like %d%b%Y:%H:%M:%OS 01FEB2015:01:08:45 /private/tmp… #> 6 459 30 1 date like %d%b%Y:%H:%M:%OS 01FEB2015:01:08:45 /private/tmp… #> 7 470 13 1 date like %d%b%Y:%H:%M:%OS 31JAN2015:20:30:18 /private/tmp… #> 8 552 30 1 date like %d%b%Y:%H:%M:%OS 01FEB2015:01:08:45 /private/tmp… #> 9 592 30 1 date like %d%b%Y:%H:%M:%OS 01FEB2015:01:08:45 /private/tmp… #> 10 680 30 1 date like %d%b%Y:%H:%M:%OS 01FEB2015:01:08:45 /private/tmp… #> 11 706 30 1 date like %d%b%Y:%H:%M:%OS 01FEB2015:01:08:45 /private/tmp… #> 12 747 30 1 date like %d%b%Y:%H:%M:%OS 01FEB2015:01:08:45 /private/tmp… #> 13 866 30 1 date like %d%b%Y:%H:%M:%OS 01FEB2015:01:08:45 /private/tmp… #> 14 881 30 1 date like %d%b%Y:%H:%M:%OS 01FEB2015:01:08:45 /private/tmp…

tidyverse / vroom

Datetime parsing randomly (but rarely) fails when using multiple threads and latin1 encoding #473