Open mpjashby opened 1 year ago
Thanks for reporting and the excellent reprex! However, I'm not able to reproduce the issue with the current CRAN release of osfr (v0.2.8).
When I run
code_pkg_files <- osfr::osf_retrieve_node("https://osf.io/zyaqn") %>%
osfr::osf_ls_files(path = "Data for R package", n_max = Inf)
The resulting osf_tbl_file
includes 630 rows, which matches the number of files in Data for R package, and includes file crime_open_database_sample_detroit_2020.Rds
.
I'm guessing this was caused by an intermittent server error. Although I would have expected the function to return an error rather than a partial result. If it occurs again you can enable verbose logging by defining the environment variable OSF_LOG
to point to a logfile. For example:
OSF_PAT=osfr.log
This should give us more insight into what's happening at the request level.
I've tried to investigate this in further detail, but not got much further. As far as I can tell is is either a problem with osf_ls_files()
or the OSF API (rather than a connection error or similar) since I've run the examples below over several hours (and multiple R sessions) and got the same results. Of note is that as well as some files being missing, osf_ls_files()
reports other files multiple times (38 times in the case shown below).
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(osfr)
#> <Logging enabled: osfr.log>
# Trying to download a file directly using `osf_retrieve_file()` works as
# expected:
osf_retrieve_file("https://osf.io/avj2k")
#> # A tibble: 1 × 3
#> name id meta
#> <chr> <chr> <list>
#> 1 crime_open_database_core_mesa_2022.Rds avj2k <named list [3]>
# But the same file (`crime_open_database_core_mesa_2022.Rds`) is missing when
# we try to retrieve a list of files with `osf_ls_files()`:
osf_retrieve_node("https://osf.io/zyaqn") |>
osf_ls_files(path = "Data for R package", n_max = Inf) |>
filter(stringr::str_detect(name, "core_mesa")) |>
arrange(name)
#> # A tibble: 6 × 3
#> name id meta
#> <chr> <chr> <list>
#> 1 crime_open_database_core_mesa_2016.Rds 5c926f5d4712b400173cc144 <named list>
#> 2 crime_open_database_core_mesa_2017.Rds 5c926f5da743a900176287aa <named list>
#> 3 crime_open_database_core_mesa_2018.Rds 5c926f5d2286e80018c62e36 <named list>
#> 4 crime_open_database_core_mesa_2019.Rds 5f009012af1156016e3b61ca <named list>
#> 5 crime_open_database_core_mesa_2020.Rds 614da1c7ec7885002840fb0d <named list>
#> 6 crime_open_database_core_mesa_2021.Rds 6347e653ec7f3f2df2f5f524 <named list>
# Just to check, if we list the same files using the API directly, the 2022 file
# is there as expected:
jsonlite::fromJSON("https://api.osf.io/v2/nodes/zyaqn/files/osfstorage/5bbde32b7cb18100193c778a/?filter%5Bname%5D=core_mesa") |>
purrr::pluck("data", "attributes", "name")
#> [1] "crime_open_database_core_mesa_2016.Rds"
#> [2] "crime_open_database_core_mesa_2017.Rds"
#> [3] "crime_open_database_core_mesa_2018.Rds"
#> [4] "crime_open_database_core_mesa_2019.Rds"
#> [5] "crime_open_database_core_mesa_2020.Rds"
#> [6] "crime_open_database_core_mesa_2021.Rds"
#> [7] "crime_open_database_core_mesa_2022.Rds"
# And an extra twist: in some cases the same file is reported by
# `osf_ls_files()` multiple times:
osfr::osf_retrieve_node("https://osf.io/zyaqn") |>
osfr::osf_ls_files(path = "Data for R package", n_max = Inf) |>
dplyr::filter(stringr::str_detect(name, "core_st_louis_2009")) |>
dplyr::arrange(name)
#> # A tibble: 38 × 3
#> name id meta
#> <chr> <chr> <list>
#> 1 crime_open_database_core_st_louis_2009.Rds 5f009039af1156016a3b… <named list>
#> 2 crime_open_database_core_st_louis_2009.Rds 5f009039af1156016a3b… <named list>
#> 3 crime_open_database_core_st_louis_2009.Rds 5f009039af1156016a3b… <named list>
#> 4 crime_open_database_core_st_louis_2009.Rds 5f009039af1156016a3b… <named list>
#> 5 crime_open_database_core_st_louis_2009.Rds 5f009039af1156016a3b… <named list>
#> 6 crime_open_database_core_st_louis_2009.Rds 5f009039af1156016a3b… <named list>
#> 7 crime_open_database_core_st_louis_2009.Rds 5f009039af1156016a3b… <named list>
#> 8 crime_open_database_core_st_louis_2009.Rds 5f009039af1156016a3b… <named list>
#> 9 crime_open_database_core_st_louis_2009.Rds 5f009039af1156016a3b… <named list>
#> 10 crime_open_database_core_st_louis_2009.Rds 5f009039af1156016a3b… <named list>
#> # ℹ 28 more rows
Created on 2023-11-06 with reprex v2.0.2
Nothing gets written to the file specified in OSF_LOG
when creating a reprex (presumably because the reprex is run in a separate session), so the following is the contents of the log file produced when running the same code as above directly after running the reprex:
Two final things to note:
OSF_PAT=osfr.log
" but I suspect this should say "You can also enable logging by defining OSF_LOG to point to a logfile. For example: OSF_LOG=osfr.log
".Hi, we've recently been running into this same issue (missing files and folders in osf_ls_files) in some projects where we host a large amount of raw data on OSF. Are there any updates on what might be causing the issue? Happy to provide details if this is actively being investigated. Thanks!
osf_ls_files()
appears to be not listing some files in osfr version 0.2.8. The files that are missing are consistent (i.e. if I call the code again the same files are missing), but otherwise I can't see any pattern why some files are returned and not others. From the osfr documentation, settingn_max = Inf
should ensure all files are returned, but this does not appear to work.Created on 2022-07-23 by the reprex package (v2.0.1)
Session info: