Open rafapereirabr opened 2 months ago
I replicated the code and it worked fine for me.
> # try to remove file 1.
> file.exists('temp_parquet1.parquet')
[1] TRUE
> file.remove('temp_parquet1.parquet')
[1] TRUE
> # remove file 2
> file.exists('temp_parquet2.parquet')
[1] TRUE
> file.remove('temp_parquet2.parquet')
[1] TRUE
Is it possible that this is a Windows issue? As I recall that readr 2.0.0 also had the lazy loading problem on Windows. https://www.tidyverse.org/blog/2021/07/readr-2-0-0/#lazy-reading
@rafapereirabr Are you using Windows?
Hi all. Yes, @eitsupi , I'm using windows and this is very much the problem. I thought that perhaps there was an issue with closing the database connection, though. So, I've tested a sliglty different version of my function and it seems to work.
The difference is that in my previous code (above in this issue) I used arrow::to_duckdb(arrow1)
to create a table that is passed to duckplyr::left_join()
. In the new code (see below), I explicility create a duckdb connection, register the arrow tables, follow to perform the duckplyr::left_join()
, and then explicitly close the connection before returning the function result.
I think this could solve my problem, but I thought perhaps there is something here that could be incorporated into {duckplyr} to avoid the same issue for future Windows users.
library(arrow)
library(duckdb)
library(duckplyr)
# download parquet files
url1 <- 'https://github.com/ipeaGIT/censobr/releases/download/v0.2.0/2010_emigration_v0.2.0.parquet'
url2 <- 'https://github.com/ipeaGIT/censobr/releases/download/v0.2.0/2010_households_v0.2.0.parquet'
curl::curl_download(url = url1, destfile = 'temp_parquet1.parquet', quiet = F)
curl::curl_download(url = url2, destfile = 'temp_parquet2.parquet', quiet = F)
# open data sets
arrow1 <- arrow::open_dataset('temp_parquet1.parquet')
arrow2 <- arrow::open_dataset('temp_parquet2.parquet')
# simple function that emulates part of what I do in my package
merge_fun <- function(arrow1, arrow2){
key_vars <- c('code_muni', 'code_state', 'abbrev_state','name_state',
'code_region', 'name_region', 'code_weighting', 'V0300')
# drop repeated vars
all_common_vars <- names(arrow1)[names(arrow1) %in% names(arrow2)]
vars_to_drop <- setdiff(all_common_vars, key_vars)
arrow2 <- dplyr::select(arrow2, -all_of(vars_to_drop)) |>
dplyr::compute()
# OLD code # convert to duckdb
# OLD code # arrow1 <- arrow::to_duckdb(arrow1)
# OLD code # arrow2 <- arrow::to_duckdb(arrow2)
# con <- DBI::dbConnect(duckdb::duckdb(), read_only = FALSE)
con <- duckdb::dbConnect(duckdb::duckdb(), read_only = FALSE)
duckdb::duckdb_register_arrow(con, 'arrow1', arrow1)
duckdb::duckdb_register_arrow(con, 'arrow2', arrow2)
# merge
df_merge <- duckplyr::left_join(dplyr::tbl(con, "arrow1"),
dplyr::tbl(con, "arrow2"))
df_merge <- dplyr::compute(df_merge)
df_merge <- arrow::to_arrow(df_merge) |>
dplyr::compute()
# remove duckdb instance
duckdb::duckdb_unregister_arrow(con, 'arrow1')
duckdb::duckdb_unregister_arrow(con, 'arrow2')
duckdb::dbDisconnect(con, shutdown = TRUE)
rm(con)
gc()
return(df_merge)
}
# run function
test <- merge_fun( arrow1, arrow2)
# try to remove file 1.
file.exists('temp_parquet1.parquet')
file.remove('temp_parquet1.parquet')
# remove file 2
file.exists('temp_parquet2.parquet')
file.remove('temp_parquet2.parquet')
In your example, I felt that you may have misunderstood the function of duckplyr.
arrow::to_duckdb()
is a function for working with dbplyr
, not duckdplyr
, and should not be intended to be combined with duckplyr
.
How about specifying dplyr::left_join()
instead of duckplyr::left_join()
?
Also, since duckdb
is faster than arrow
in my experience, it may be faster to use dbplyr
(or duckplyr
) completely instead of using arrow::to_duckdb()
.
Hi @eitsupi . For the moment, using directly dplyr::left_join()
on arrow tables is not an option in my case. The arrow package in cannot cannot cope with merging large data sets (see this issue)
This is what motivated me to use duckplyr::left_join()
, because this way I would get best performance with duckdb but would still be able to work with more tidyverse friendly code
In that case, why would you use the arrow
package? Simply duckdb
and dbplyr
(or duckplyr
) may be sufficient.
I only knew {arrow} when I started building the {censobr} package, it was a perfect choice because of the the really simple integration with {dplyr} . The only bottleneck I've found with {arrow} so far is merging really large data sets. I'm now considering importing {duckdb} or {duckplyr} to improve the performance of {censobr} when performing out-of-memory joins. Now I see {duckplyr} doesn't have integration with arrow. right?
hi all, I'm tryin to use {duckplyr} as a dependency in one of my packages but I'm facing a strange problem. Basically, my package downloads and caches locally a few parquet files that it then uses to do a few basic operations. Due to CRAN policies , users should be able to manage / delete all files from the cache. However, for some reason, I cannot remove a local file after processing it with an operation with {duckplyr}. See reproducible example below.
obs. {duckplyr} is an amazing package, so thank you all for such a great work !
reprex
Now here's when the error happens. After running the code above, I cannot remove the local file
"temp_parquet1.parquet"
.Surprisingly, there is no problem when I try to delete
"temp_parquet2.parquet"
.