Closed DyfanJones closed 3 weeks ago
It looks like the issue is coming from:
tmp_file = paste0(tempdir(), "/", make.names(actual_url))
download.file(url = actual_url, destfile = tmp_file)
This causes download.file
fail. A possible solution is to either hash the actual_url or use environments to cache the temp file name. For example:
cache_temp_file <- new.env(parent = new.env())
check_is_link = function(path, reuse_downloaded, raise_error = FALSE) {
# do nothing let path fail on rust side
if (is.na(path)) {
return(NULL)
}
if (!file.exists(path)) {
con = NULL
# check if possible to open url connection
assumed_schemas = c("", "https://", "http://", "ftp://")
for (i_schema in assumed_schemas) {
if (!is.null(con)) break
actual_url = paste0(i_schema, path)
suppressWarnings(
tryCatch(
{
con = url(actual_url, open = "rt")
},
error = function(e) {}
)
)
}
# try download file if valid url
if (!is.null(con)) {
close(con)
if (is.null(cache_temp_file[[actual_url]]))
cache_temp_file[[actual_url]] <- tempfile()
if (isFALSE(reuse_downloaded) || isFALSE(file.exists(cache_temp_file[[actual_url]]))) {
download.file(url = actual_url, destfile = cache_temp_file[[actual_url]])
message(paste("tmp file placed in \n", cache_temp_file[[actual_url]]))
}
path = cache_temp_file[[actual_url]] # redirect path to tmp downloaded file
} else {
if (raise_error) {
stop("failed to locate file at path/url: ", path)
}
# do nothing let path fail on rust side
path = NULL
}
}
path
}
Happy to raise PR if this approach seem ok with you guys :)
Hi @DyfanJones, thanks for the report and the proposed fix.
This looks good to me but I'm wondering if using a simple list instead of an environment would do the trick? I can't try myself because your reprex requires some S3 credentials that I don't have. Also, I'd like to have a test for this, do you think it would be possible to add one without any dependencies?
Sadly a simple list won't work, as R doesn't really do stuff by reference apart from environments. Here is a small example:
demo_list <- list()
demo_list_fn <- function(x) {
if (is.null(demo_list[[x]])) demo_list[[x]] <- tempfile()
}
demo_env <- new.env(parent = new.env())
demo_env_fn <- function(x) {
if (is.null(demo_env[[x]])) demo_env[[x]] <- tempfile()
}
x <- "helloworld"
demo_list_fn(x)
demo_env_fn(x)
demo_list
#> list()
demo_env[[x]]
#> [1] "/var/folders/sp/scbzkbwx6hbchmylsx0y52k80000gn/T//RtmpYkGFjV/file2e7042d07df"
Created on 2024-04-17 with reprex v2.1.0
For unit testing we can just do a mock test as the only thing we want to test is the caching of tempfile locations :)
Saying that there are aways around it:
demo_list <- list()
demo_list_fn <- function(x) {
if (is.null(demo_list[[x]])) demo_list[[x]] <- tempfile()
}
demo_env <- new.env(parent = new.env())
demo_env_fn <- function(x) {
if (is.null(demo_env[[x]])) demo_env[[x]] <- tempfile()
}
demo_list_fn_2 <- function(x) {
if (is.null(demo_list[[x]])) demo_list[[x]] <- tempfile()
assign("demo_list", demo_list, envir = parent.frame())
}
demo_list_fn_3 <- function(x) {
if (is.null(demo_list[[x]])) demo_list[[x]] <<- tempfile()
}
x <- "helloworld"
demo_list_fn(x)
demo_env_fn(x)
demo_list
#> list()
demo_env[[x]]
#> [1] "/var/folders/sp/scbzkbwx6hbchmylsx0y52k80000gn/T//Rtmp1J1nrV/file45ec63b5092f"
demo_list_fn_2(x)
demo_list
#> $helloworld
#> [1] "/var/folders/sp/scbzkbwx6hbchmylsx0y52k80000gn/T//Rtmp1J1nrV/file45ec26e1fcd"
demo_list_fn_3("Goodbye world")
demo_list
#> $helloworld
#> [1] "/var/folders/sp/scbzkbwx6hbchmylsx0y52k80000gn/T//Rtmp1J1nrV/file45ec26e1fcd"
#>
#> $`Goodbye world`
#> [1] "/var/folders/sp/scbzkbwx6hbchmylsx0y52k80000gn/T//Rtmp1J1nrV/file45ecbe9ccfa"
Created on 2024-04-17 with reprex v2.1.0
The problem with <<-
is that the package environment will be locked. Soooo it will error. Another option is to use function environments for example
cache_file <- function() {
cache <- list()
temp_file <- function(x) {
if (is.null(cache[[x]])) cache[[x]] <<- tempfile()
return(cache[[x]])
}
return(temp_file)
}
## initialise in package `.onLoad`
cache <- cache_file()
cache("helloworld")
#> [1] "/var/folders/sp/scbzkbwx6hbchmylsx0y52k80000gn/T//RtmpXgXxke/file16c13e313726"
cache("helloworld")
#> [1] "/var/folders/sp/scbzkbwx6hbchmylsx0y52k80000gn/T//RtmpXgXxke/file16c13e313726"
Created on 2024-04-17 with reprex v2.1.0
So it will look something like:
check_is_link = function(path, reuse_downloaded, raise_error = FALSE) {
# do nothing let path fail on rust side
if (is.na(path)) {
return(NULL)
}
if (!file.exists(path)) {
con = NULL
# check if possible to open url connection
assumed_schemas = c("", "https://", "http://", "ftp://")
for (i_schema in assumed_schemas) {
if (!is.null(con)) break
actual_url = paste0(i_schema, path)
suppressWarnings(
tryCatch(
{
con = url(actual_url, open = "rt")
},
error = function(e) {}
)
)
}
# try download file if valid url
if (!is.null(con)) {
close(con)
if (isFALSE(reuse_downloaded) || isFALSE(file.exists(cache(actual_url)))) {
download.file(url = actual_url, destfile = cache(actual_url))
message(paste("tmp file placed in \n", cache(actual_url)))
}
path = cache(actual_url) # redirect path to tmp downloaded file
} else {
if (raise_error) {
stop("failed to locate file at path/url: ", path)
}
# do nothing let path fail on rust side
path = NULL
}
}
path
}
I think it's fine to use an environment as in your first case. Sorry for the extra work, it was mostly out of curiosity. Do you want to make a PR?
No worries :) happy to give as many options to get the best option :) Happy to raise a PR.
Thanks for your contribution!
Tip: Do not use stop()
inside this package, use something like the following includes Err_plain()
and unwrap()
instead.
Happy to fix stop
replace with Err_plain()
and unwrap()
. @eitsupi Do you want the PR to fix the stop()
for the file: https://github.com/pola-rs/r-polars/blob/main/R/io_csv.R
Yes, stop()
is still in various places and has not been completely replaced (#568), PRs are welcome!
Hi all,
It looks like there is an issue in reading from urls that are too long.
Created on 2024-04-16 with reprex v2.1.0