tidyverse / vroom

Fast reading of delimited files
https://vroom.r-lib.org
Other
620 stars 60 forks source link

Datetime parsing randomly (but rarely) fails when using multiple threads and latin1 encoding #473

Open nograpes opened 2 years ago

nograpes commented 2 years ago

Thank you for the lovely package. When using vroom to parse a file with datetime values, with the latin1 encoding and more than one thread, randomly, but very rarely, it will report that certain times are not formatted as expected.

I have tried to make this example minimal, but because it isn't deterministic, I have had to guess at the size of data and number of replications needed to consistently generate at least one error. Below is code for the bug reproduction.

# Create test file.
times <- 
  c("31JAN2015:18:47:49", "31JAN2015:19:35:09", "31JAN2015:21:10:28", 
    "31JAN2015:20:02:19", "31JAN2015:18:04:39", "31JAN2015:19:58:32", 
    "31JAN2015:18:07:25", "31JAN2015:18:30:29", "31JAN2015:19:54:57", 
    "31JAN2015:20:17:13", "31JAN2015:19:44:46", "31JAN2015:20:30:18", 
    "31JAN2015:20:01:47", "31JAN2015:20:35:36", "31JAN2015:20:21:47", 
    "31JAN2015:18:39:52", "31JAN2015:20:51:51", "31JAN2015:21:26:30", 
    "31JAN2015:21:27:06", "31JAN2015:20:07:45", "31JAN2015:22:02:21", 
    "31JAN2015:20:35:48", "31JAN2015:20:23:30", "31JAN2015:21:10:12", 
    "31JAN2015:22:05:21", "31JAN2015:20:26:31", "31JAN2015:22:16:10", 
    "31JAN2015:22:11:14", "01FEB2015:01:08:45")

file <- tempfile()
write.csv(data.frame(a = times), 
          file, 
          row.names = FALSE,
          fileEncoding = "latin1")

library(vroom)

probs <- function(){
  test <-
    vroom::vroom(file,
                 delim = ";", # Can be anything not in times.
                 progress = FALSE, 
                 num_threads = 2, # Anything greater than 1
                 locale = locale(
                   encoding = "latin1" # Necessary for bug repro
                 ),
                 col_types = cols(
                   a = col_datetime(format = "%d%b%Y:%H:%M:%OS")
                 )
    )
  problems(test)
}

# Read test file 5000 times.
first <- replicate(5000, probs(), simplify = FALSE)
# Display all reads with problems.
first[sapply(first,nrow)>0]

I would expect that code to not fail on any read. Even if there was an error, I would expect it to be the same error every time. But on all machines I have tested you will get some reads that fail on random rows, like:

[[1]]
# A tibble: 1 x 5
    row   col expected                   actual             file                                                  
  <int> <int> <chr>                      <chr>              <chr>                                                 
1     6     1 date like %d%b%Y:%H:%M:%OS 31JAN2015:18:04:39 -

[[2]]
# A tibble: 1 x 5
    row   col expected                   actual             file                                                  
  <int> <int> <chr>                      <chr>              <chr>                                                 
1     8     1 date like %d%b%Y:%H:%M:%OS 31JAN2015:18:07:25 -

[[3]]
# A tibble: 1 x 5
    row   col expected                   actual             file                                                  
  <int> <int> <chr>                      <chr>              <chr>                                                 
1     9     1 date like %d%b%Y:%H:%M:%OS 31JAN2015:18:30:29 -

I have recreated this issue on Windows and Linux with vroom 1.5.7, with R version 4.1.3. I have also recreated this issue with the development version of vroom (1.6.0.9000). I also tested on R 3.6.3 on Linux.

hadley commented 1 year ago

I see this too. Slighty improve tweaked reprex below:

library(vroom)

times <- c("31JAN2015:18:47:49", "31JAN2015:19:35:09", "31JAN2015:21:10:28", "31JAN2015:20:02:19", "31JAN2015:18:04:39", "31JAN2015:19:58:32", "31JAN2015:18:07:25", "31JAN2015:18:30:29", "31JAN2015:19:54:57", "31JAN2015:20:17:13", "31JAN2015:19:44:46", "31JAN2015:20:30:18", "31JAN2015:20:01:47", "31JAN2015:20:35:36", "31JAN2015:20:21:47", "31JAN2015:18:39:52", "31JAN2015:20:51:51", "31JAN2015:21:26:30", "31JAN2015:21:27:06", "31JAN2015:20:07:45", "31JAN2015:22:02:21", "31JAN2015:20:35:48", "31JAN2015:20:23:30", "31JAN2015:21:10:12", "31JAN2015:22:05:21", "31JAN2015:20:26:31", "31JAN2015:22:16:10", "31JAN2015:22:11:14", "01FEB2015:01:08:45")

file <- tempfile()
write.csv(data.frame(a = times), file, row.names = FALSE, fileEncoding = "latin1")

probs <- function() {
  test <- vroom(
    file,
    delim = ",",
    progress = FALSE,
    num_threads = 2,
    locale = locale(encoding = "latin1"),
    col_types = cols(a = col_datetime(format = "%d%b%Y:%H:%M:%OS"))
  )
  problems(test)
}

first <- suppressWarnings(replicate(1000, probs(), simplify = FALSE))
dplyr::bind_rows(first, .id = "id")
#> # A tibble: 14 × 6
#>    id      row   col expected                   actual             file         
#>    <chr> <int> <int> <chr>                      <chr>              <chr>        
#>  1 79       30     1 date like %d%b%Y:%H:%M:%OS 01FEB2015:01:08:45 /private/tmp…
#>  2 85       30     1 date like %d%b%Y:%H:%M:%OS 01FEB2015:01:08:45 /private/tmp…
#>  3 132      30     1 date like %d%b%Y:%H:%M:%OS 01FEB2015:01:08:45 /private/tmp…
#>  4 133      30     1 date like %d%b%Y:%H:%M:%OS 01FEB2015:01:08:45 /private/tmp…
#>  5 243      30     1 date like %d%b%Y:%H:%M:%OS 01FEB2015:01:08:45 /private/tmp…
#>  6 459      30     1 date like %d%b%Y:%H:%M:%OS 01FEB2015:01:08:45 /private/tmp…
#>  7 470      13     1 date like %d%b%Y:%H:%M:%OS 31JAN2015:20:30:18 /private/tmp…
#>  8 552      30     1 date like %d%b%Y:%H:%M:%OS 01FEB2015:01:08:45 /private/tmp…
#>  9 592      30     1 date like %d%b%Y:%H:%M:%OS 01FEB2015:01:08:45 /private/tmp…
#> 10 680      30     1 date like %d%b%Y:%H:%M:%OS 01FEB2015:01:08:45 /private/tmp…
#> 11 706      30     1 date like %d%b%Y:%H:%M:%OS 01FEB2015:01:08:45 /private/tmp…
#> 12 747      30     1 date like %d%b%Y:%H:%M:%OS 01FEB2015:01:08:45 /private/tmp…
#> 13 866      30     1 date like %d%b%Y:%H:%M:%OS 01FEB2015:01:08:45 /private/tmp…
#> 14 881      30     1 date like %d%b%Y:%H:%M:%OS 01FEB2015:01:08:45 /private/tmp…

Created on 2023-08-01 with reprex v2.0.2

It's weird that the encoding is important for the reprex, giving that it's a pure ASCII file.