tidyverse / vroom

Fast reading of delimited files
https://vroom.r-lib.org
Other
620 stars 60 forks source link

locking file when saving #395

Closed nbbn closed 2 years ago

nbbn commented 2 years ago

Hi! I've got problem of empty csv file while using readr package, that uses vroom. At some moment I've got absolutely empty file, without even header. File is used in Shiny app and has multiple read and write operations. I guess, save operation should be atomic, please find some example below:

# needed packages: future, promises, readr, vroom, data.atble
future::plan(future::multisession(workers = 4L))
# init csv file
data.table::fwrite(mtcars, "test.csv")

promises::future_promise({
  for (i in 1:100) {
    if (file.info("test.csv")$size == 0) {
      message("empty file!")
    }
    tryCatch({readr::read_csv(mtcars, "test.csv", show_col_types = FALSE)}, error=function(e) {NULL})
  }
})
promises::future_promise({
  for (i in 1:20) {
  tryCatch({vroom::vroom_write(mtcars, "test.csv", progress = FALSE)}, error=function(e) {NULL})
  #tryCatch({data.table::fwrite(mtcars, "test.csv", showProgress = FALSE)}, error=function(e) {NULL})
  }
})

"empty file!" message shows that for some time, csv file can be read and is empty. If you will switch in this example to data.table variant, you will see no messages.

jennybc commented 2 years ago

Can you provide an example that does not involve future and promises?

What versions of readr and vroom do you have?

nbbn commented 2 years ago

Not really, the problem is an example of some race condition. You have to read file while other code is saving it. Another example would require two separate scripts running at the same time.

> osVersion
[1] "Ubuntu 20.04.3 LTS"
> packageVersion("vroom")
[1] ‘1.5.7’
> packageVersion("readr")
[1] ‘2.1.1’
> R.version
               _                           
platform       x86_64-pc-linux-gnu         
arch           x86_64                      
os             linux-gnu                   
system         x86_64, linux-gnu           
status                                     
major          4                           
minor          1.2                         
year           2021                        
month          11                          
day            01                          
svn rev        81115                       
language       R                           
version.string R version 4.1.2 (2021-11-01)
nickname       Bird Hippie  
nbbn commented 2 years ago

@sbearrows can you tell me why this issue is closed? Was it fixed?

jennybc commented 2 years ago

@nbbn We're just concluding a wholesale triage of all readr and vroom issues. While we don't deny what you're seeing, your problem seems to only play out in a rather unusual situation. We'll keep it in mind, in case it ends up being connected to something that comes up in more common usage. But we're simply making some realistic decisions about what our near term priorities are.

I'll also tag @DavisVaughan who may have more previous experience with the promises package, in case he has a quick observation about your situation.

DavisVaughan commented 2 years ago

I've managed to make an example with just callr. I'm not sure what readr/vroom can improve on beyond trying not to crash here. I think there is just no data to read at some points due to the race condition.

Select the entire code chunk and try and run it all at once. It should do something like:

library(callr)

path <- tempfile(fileext = ".csv")
vroom::vroom_write(mtcars, path, progress = FALSE)

read_it <- function(path) {
  for (i in 1:100) {
    vroom::vroom(path)
  }
  NULL
}
write_it <- function(path) {
  for (i in 1:100) {
    vroom::vroom_write(mtcars, path, progress = FALSE)
  }
  NULL
}

bg_read <- r_bg(func = read_it, args = list(path = path))
bg_write <- r_bg(func = write_it, args = list(path = path))

bg_read$wait()
bg_write$wait()

bg_read$get_result()
# > Error in get_result(out, private$options) : 
#>  callr subprocess failed: could not start R, exited with non-zero status, has crashed or was killed
#>  Type .Last.error.trace to see where the error occurred
bg_write$get_result()

Note that it wasn't working in a reprex for whatever reason. The reprex just ran infinitely.

So this implies that the background session that is reading the file is crashing. You can reproduce that in your master session with:

library(callr)

path <- tempfile(fileext = ".csv")
vroom::vroom_write(mtcars, path, progress = FALSE)

read_it <- function(path) {
  for (i in 1:100) {
    vroom::vroom(path)
  }
  NULL
}
write_it <- function(path) {
  for (i in 1:100) {
    vroom::vroom_write(mtcars, path, progress = FALSE)
  }
  NULL
}

bg_write <- r_bg(func = write_it, args = list(path = path))

read_it(path)

This should crash your R session.

I can make this occur with data.table too, that doesn't crash but it errors because it is trying to read an empty file:

library(callr)

path <- tempfile(fileext = ".csv")
vroom::vroom_write(mtcars, path, progress = FALSE)

read_it <- function(path) {
  for (i in 1:100) {
    data.table::fread(file = path)
  }
  NULL
}
write_it <- function(path) {
  for (i in 1:100) {
    data.table::fwrite(mtcars, path, showProgress = FALSE)
  }
  NULL
}

bg_read <- r_bg(func = read_it, args = list(path = path))
bg_write <- r_bg(func = write_it, args = list(path = path))

bg_read$wait()
bg_write$wait()

bg_read$get_result()
#> Error: callr subprocess failed: File is empty: /var/folders/41/qx_9ygp112nfysdfgxcssgwc0000gn/T//RtmpH4y2Hl/file16dc230ebf10.csv
bg_write$get_result()
#> NULL