tidyverse / duckplyr

A drop-in replacement for dplyr, powered by DuckDB for performance.
https://duckplyr.tidyverse.org/
Other
285 stars 19 forks source link

Cannot remove local file after operation with duckplyr #257

Open rafapereirabr opened 2 months ago

rafapereirabr commented 2 months ago

hi all, I'm tryin to use {duckplyr} as a dependency in one of my packages but I'm facing a strange problem. Basically, my package downloads and caches locally a few parquet files that it then uses to do a few basic operations. Due to CRAN policies , users should be able to manage / delete all files from the cache. However, for some reason, I cannot remove a local file after processing it with an operation with {duckplyr}. See reproducible example below.

obs. {duckplyr} is an amazing package, so thank you all for such a great work !

reprex

library(arrow)
library(duckdb)
library(duckplyr)

# download parquet files
url1 <- 'https://github.com/ipeaGIT/censobr/releases/download/v0.2.0/2010_emigration_v0.2.0.parquet'
url2 <- 'https://github.com/ipeaGIT/censobr/releases/download/v0.2.0/2010_households_v0.2.0.parquet'
curl::curl_download(url = url1, destfile = 'temp_parquet1.parquet')
curl::curl_download(url = url2, destfile = 'temp_parquet2.parquet')

# open data sets
arrow1 <- arrow::open_dataset('temp_parquet1.parquet')
arrow2 <- arrow::open_dataset('temp_parquet2.parquet')

# simple function that emulates part of what I do in my package
merge_fun <- function(arrow1, arrow2){

  key_vars <- c('code_muni', 'code_state', 'abbrev_state','name_state',
                'code_region', 'name_region', 'code_weighting', 'V0300')

  # drop repeated vars
  all_common_vars <- names(arrow1)[names(arrow1) %in% names(arrow2)]
  vars_to_drop <- setdiff(all_common_vars, key_vars)
  arrow2 <- dplyr::select(arrow2, -all_of(vars_to_drop)) |>
    dplyr::compute()

  # convert to duckdb
  arrow1 <- arrow::to_duckdb(arrow1)
  arrow2 <- arrow::to_duckdb(arrow2)

  # merge
  df_merge <- duckplyr::left_join(arrow1, arrow2) |>
            dplyr::compute()

  df_merge <- arrow::to_arrow(df_merge)
  return(df_merge)
}

# run function
test <- merge_fun( arrow1, arrow2)

Now here's when the error happens. After running the code above, I cannot remove the local file "temp_parquet1.parquet".

# try to remove file 1. 
file.exists('temp_parquet1.parquet')
#> TRUE
file.remove('temp_parquet1.parquet')

Warning message: In file.remove("temp_parquet1.parquet") : cannot remove file 'temp_parquet1.parquet', reason 'Permission denied'

Surprisingly, there is no problem when I try to delete "temp_parquet2.parquet".

# remove file 2
file.exists('temp_parquet2.parquet')
#> TRUE
file.remove('temp_parquet2.parquet')
#> TRUE
vcardonas commented 2 months ago

I replicated the code and it worked fine for me.

> # try to remove file 1. 
> file.exists('temp_parquet1.parquet')
[1] TRUE
> file.remove('temp_parquet1.parquet')
[1] TRUE
> # remove file 2
> file.exists('temp_parquet2.parquet')
[1] TRUE
> file.remove('temp_parquet2.parquet')
[1] TRUE
eitsupi commented 2 months ago

Is it possible that this is a Windows issue? As I recall that readr 2.0.0 also had the lazy loading problem on Windows. https://www.tidyverse.org/blog/2021/07/readr-2-0-0/#lazy-reading

@rafapereirabr Are you using Windows?

rafapereirabr commented 2 months ago

Hi all. Yes, @eitsupi , I'm using windows and this is very much the problem. I thought that perhaps there was an issue with closing the database connection, though. So, I've tested a sliglty different version of my function and it seems to work.

The difference is that in my previous code (above in this issue) I used arrow::to_duckdb(arrow1) to create a table that is passed to duckplyr::left_join(). In the new code (see below), I explicility create a duckdb connection, register the arrow tables, follow to perform the duckplyr::left_join(), and then explicitly close the connection before returning the function result.

I think this could solve my problem, but I thought perhaps there is something here that could be incorporated into {duckplyr} to avoid the same issue for future Windows users.

library(arrow)
library(duckdb)
library(duckplyr)

# download parquet files
url1 <- 'https://github.com/ipeaGIT/censobr/releases/download/v0.2.0/2010_emigration_v0.2.0.parquet'
url2 <- 'https://github.com/ipeaGIT/censobr/releases/download/v0.2.0/2010_households_v0.2.0.parquet'
curl::curl_download(url = url1, destfile = 'temp_parquet1.parquet', quiet = F)
curl::curl_download(url = url2, destfile = 'temp_parquet2.parquet', quiet = F)

# open data sets
arrow1 <- arrow::open_dataset('temp_parquet1.parquet')
arrow2 <- arrow::open_dataset('temp_parquet2.parquet')

# simple function that emulates part of what I do in my package
merge_fun <- function(arrow1, arrow2){

  key_vars <- c('code_muni', 'code_state', 'abbrev_state','name_state',
                'code_region', 'name_region', 'code_weighting', 'V0300')

  # drop repeated vars
  all_common_vars <- names(arrow1)[names(arrow1) %in% names(arrow2)]
  vars_to_drop <- setdiff(all_common_vars, key_vars)
  arrow2 <- dplyr::select(arrow2, -all_of(vars_to_drop)) |>
    dplyr::compute()

  # OLD code # convert to duckdb
  # OLD code # arrow1 <- arrow::to_duckdb(arrow1)
  # OLD code # arrow2 <- arrow::to_duckdb(arrow2)

  # con <- DBI::dbConnect(duckdb::duckdb(), read_only = FALSE)
  con <- duckdb::dbConnect(duckdb::duckdb(), read_only = FALSE)
  duckdb::duckdb_register_arrow(con, 'arrow1', arrow1)
  duckdb::duckdb_register_arrow(con, 'arrow2', arrow2)

  # merge
  df_merge <- duckplyr::left_join(dplyr::tbl(con, "arrow1"),
                                  dplyr::tbl(con, "arrow2"))

  df_merge <- dplyr::compute(df_merge)

  df_merge <- arrow::to_arrow(df_merge)  |>
              dplyr::compute()

  # remove duckdb instance
  duckdb::duckdb_unregister_arrow(con, 'arrow1')
  duckdb::duckdb_unregister_arrow(con, 'arrow2')
  duckdb::dbDisconnect(con, shutdown = TRUE)
  rm(con)
  gc()
  return(df_merge)
}

# run function
test <- merge_fun( arrow1, arrow2)

# try to remove file 1.
file.exists('temp_parquet1.parquet')
file.remove('temp_parquet1.parquet')

# remove file 2
file.exists('temp_parquet2.parquet')
file.remove('temp_parquet2.parquet')
eitsupi commented 2 months ago

In your example, I felt that you may have misunderstood the function of duckplyr. arrow::to_duckdb() is a function for working with dbplyr, not duckdplyr, and should not be intended to be combined with duckplyr. How about specifying dplyr::left_join() instead of duckplyr::left_join()?

Also, since duckdb is faster than arrow in my experience, it may be faster to use dbplyr (or duckplyr) completely instead of using arrow::to_duckdb().

rafapereirabr commented 2 months ago

Hi @eitsupi . For the moment, using directly dplyr::left_join() on arrow tables is not an option in my case. The arrow package in cannot cannot cope with merging large data sets (see this issue)

This is what motivated me to use duckplyr::left_join(), because this way I would get best performance with duckdb but would still be able to work with more tidyverse friendly code

eitsupi commented 2 months ago

In that case, why would you use the arrow package? Simply duckdb and dbplyr (or duckplyr) may be sufficient.

rafapereirabr commented 2 months ago

I only knew {arrow} when I started building the {censobr} package, it was a perfect choice because of the the really simple integration with {dplyr} . The only bottleneck I've found with {arrow} so far is merging really large data sets. I'm now considering importing {duckdb} or {duckplyr} to improve the performance of {censobr} when performing out-of-memory joins. Now I see {duckplyr} doesn't have integration with arrow. right?