tidyverse / googledrive

Google Drive R API
https://googledrive.tidyverse.org/
Other
322 stars 47 forks source link

as_dribble() and drive_ls() get stuck in folders 3 levels deep #446

Closed gorkang closed 12 months ago

gorkang commented 1 year ago

Hi there!

When trying to use {pins} with Google Drive, I am encountering some issues, the last of which, seems to be related with {googledrive} having troubles finding a 3-levels deep folder as in Level1/Level2/Level3/. My Google Drive has about 400K files and >1TB.

As you can see in the reprex below, as_dribble() and drive_ls() have no issues with one and two level deep folders, but adding a third level makes them get stuck.


# 1 level folder
path = paste0("pins-testing/")
googledrive::as_dribble(path)
#> # A dribble: 1 × 4
#>   name         path          id                                drive_resource   
#>   <chr>        <chr>         <drv_id>                          <list>           
#> 1 pins-testing pins-testing/ 1MStG1e73DoRO8rxGG93uRBSQIBfUS6ai <named list [34]>
googledrive::drive_ls(path)
#> # A dribble: 3 × 3
#>   name     id                                drive_resource   
#>   <chr>    <drv_id>                          <list>           
#> 1 pid_X    1H0tSPVNNDeUnkjIEOAKQeQTSL7g_-Jea <named list [34]>
#> 2 pid_999x 1VZQA3fLm3Vsp00Q1e_FaayBXkSChmCFB <named list [33]>
#> 3 pid_999  1yG-KuNSikiACquLAvZbm24B30-2whzm- <named list [33]>

# 2 levels folder
path = paste0("pins-testing/pid_X/")
googledrive::as_dribble(path)
#> # A dribble: 1 × 4
#>   name  path                id                                drive_resource   
#>   <chr> <chr>               <drv_id>                          <list>           
#> 1 pid_X pins-testing/pid_X/ 1H0tSPVNNDeUnkjIEOAKQeQTSL7g_-Jea <named list [34]>
googledrive::drive_ls(path)
#> # A dribble: 3 × 3
#>   name                   id                                drive_resource   
#>   <chr>                  <drv_id>                          <list>           
#> 1 SimpleName             1X0Bt8TXaoczBsKvUleI0MDG0EKlZtc-w <named list [34]>
#> 2 -                      1mlkMFIpcb3Pe0AyWwWJsLT7DpdocC1ZY <named list [33]>
#> 3 20230908T062343Z-db9b5 1_1b8j5glMUMBtmsvseMsETeKAHd-yDOi <named list [34]>

# 3 levels folder
path = paste0("pins-testing/pid_X/20230908T062343Z-db9b5/")

# This two get stuck forever
# googledrive::drive_ls(path)
# httr::with_verbose(googledrive::as_dribble(path))

# MANUALLY PASTED THIS HERE
#>-> GET /drive/v3/files?#>orderBy=recency%20desc&q=%28trashed%20%3D%20false%29%20and%20%28mimeType%20%3D%20%27application%2Fvnd.google-#>apps.folder%27%29&supportsAllDrives=TRUE&fields=nextPageToken%2C%2A&pageToken=~%21%21~AI9FV7TmS1p5A_fnD_ADi00BMVvamke8nm9NmPnV1O9_k9OlCbRbYMQV-SR0Q7gXzFCEADgbCVf37JHJvP-_dSgcJwHAWQDflgmFECOZRgE4UujQEJvgyYEUF1aAL8ZOPqNKJF5smipiwCGMVpAs0W5CxDkfXxOApNuKj8m1IlGIK8XMNPxsvayoYa0Yf-#>MgfqUi0okfcb2OKy_WmTQrHSbp8E6380yR1JLTaCS7dxU1P41PbZCfpsqiwiilf018rNC31ySclHMptSYC1lyC6dJJFYR9eub0r9tr4UetbEMJr7t_AULQHi8FMa0sNQmmgb2qxt-wT7NX6YbFitdjnVujYG8uajjA5w%3D%3D HTTP/2
#>-> Host: www.googleapis.com
#>-> user-agent: googledrive/2.1.1 (GPN:RStudio; ) gargle/1.5.2 httr/1.4.7
#>-> accept-encoding: deflate, gzip, br, zstd
#>-> accept: application/json, text/xml, application/xml, */*
#>-> authorization: Bearer  [EDITED]-> 
#><- HTTP/2 200 
#><- vary: Origin, X-Origin
#><- pragma: no-cache
#><- expires: Mon, 01 Jan 1990 00:00:00 GMT
#><- cache-control: no-cache, no-store, max-age=0, must-revalidate
#><- date: Fri, 08 Sep 2023 06:46:37 GMT
#><- content-type: application/json; charset=UTF-8
#><- content-encoding: gzip
#><- server: ESF
#><- content-length: 10065
#><- x-xss-protection: 0
#><- x-frame-options: SAMEORIGIN
#><- x-content-type-options: nosniff
#><- alt-svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000
#><- 

Created on 2023-09-08 with reprex v2.0.2

Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.3.1 (2023-06-16) #> os Ubuntu 22.04.3 LTS #> system x86_64, linux-gnu #> ui X11 #> language (EN) #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz Atlantic/Canary #> date 2023-09-08 #> pandoc 3.1.1 @ /usr/lib/rstudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> askpass 1.2.0 2023-09-03 [1] RSPM (R 4.3.0) #> cli 3.6.1 2023-03-23 [1] RSPM #> curl 5.0.2 2023-08-14 [1] RSPM (R 4.3.0) #> digest 0.6.33 2023-07-07 [1] CRAN (R 4.3.1) #> dplyr 1.1.3 2023-09-03 [1] RSPM (R 4.3.0) #> evaluate 0.21 2023-05-05 [1] CRAN (R 4.3.0) #> fansi 1.0.4 2023-01-22 [1] RSPM #> fastmap 1.1.1 2023-02-24 [1] RSPM #> fs 1.6.3 2023-07-20 [1] RSPM (R 4.3.0) #> gargle 1.5.2 2023-07-20 [1] RSPM (R 4.3.0) #> generics 0.1.3 2022-07-05 [1] RSPM #> glue 1.6.2 2022-02-24 [1] RSPM #> googledrive 2.1.1 2023-06-11 [1] CRAN (R 4.3.0) #> htmltools 0.5.6 2023-08-10 [1] RSPM (R 4.3.0) #> httr 1.4.7 2023-08-15 [1] RSPM (R 4.3.0) #> jsonlite 1.8.7 2023-06-29 [1] RSPM (R 4.3.0) #> knitr 1.43 2023-05-25 [1] RSPM (R 4.3.0) #> lifecycle 1.0.3 2022-10-07 [1] RSPM #> magrittr 2.0.3 2022-03-30 [1] RSPM #> openssl 2.1.0 2023-07-15 [1] RSPM (R 4.3.0) #> pillar 1.9.0 2023-03-22 [1] RSPM #> pkgconfig 2.0.3 2019-09-22 [1] RSPM #> purrr 1.0.2 2023-08-10 [1] RSPM (R 4.3.0) #> R.cache 0.16.0 2022-07-21 [1] RSPM #> R.methodsS3 1.8.2 2022-06-13 [1] RSPM #> R.oo 1.25.0 2022-06-12 [1] RSPM #> R.utils 2.12.2 2022-11-11 [1] RSPM #> R6 2.5.1 2021-08-19 [1] RSPM #> rappdirs 0.3.3 2021-01-31 [1] RSPM #> reprex 2.0.2 2022-08-17 [1] RSPM #> rlang 1.1.1 2023-04-28 [1] CRAN (R 4.3.0) #> rmarkdown 2.24 2023-08-14 [1] RSPM (R 4.3.0) #> rstudioapi 0.15.0 2023-07-07 [1] CRAN (R 4.3.1) #> sessioninfo 1.2.2 2021-12-06 [1] RSPM #> styler 1.10.2 2023-08-29 [1] RSPM (R 4.3.0) #> tibble 3.2.1 2023-03-20 [1] RSPM #> tidyselect 1.2.0 2022-10-10 [1] RSPM #> utf8 1.2.3 2023-01-31 [1] RSPM #> vctrs 0.6.3 2023-06-14 [1] RSPM (R 4.3.0) #> withr 2.5.0 2022-03-03 [1] RSPM #> xfun 0.40 2023-08-09 [1] RSPM (R 4.3.0) #> yaml 2.3.7 2023-01-23 [1] RSPM #> #> [1] /home/emrys/R/x86_64-pc-linux-gnu-library/4.3 #> [2] /usr/local/lib/R/site-library #> [3] /usr/lib/R/site-library #> [4] /usr/lib/R/library #> #> ────────────────────────────────────────────────────────────────────────────── ```
jennybc commented 12 months ago

I'm not working on googledrive right now, so this is a rather superficial response.

But a function such as drive_ls() is entirely a googledrive creation. The Drive API does not actually offer support for listing everything in a folder. So we are doing lots of recursive work inside googledrive, with many API calls, which is undoubtedly very slow. I can easily imagine that whatever you are trying to do ("3-levels deep folder", "My Google Drive has about 400K files") is running up against practical performance constraints of the rather naive implementation we have here.

So my very high-level advice is to approach this from a different angle.

You'll have to play around a bit, but the idea is to not create a request that forces googledrive to range over all 400K of your files trying to resolve a filepath with many components (folders within folders within folders).

Here is a horrible sketch (I'm sure this code does not work, but it should convey the idea):

library(googledrive)
library(tidyverse)

d1 <- drive_get("pins-testing/")
d1_listing <- drive_ls(d1)
d2 <- filter(d1_listing, name == "pid_X")
d2_listing <- drive_ls(d2)
d3 <- filter(d2_listing, name == "20230908T062343Z-db9b5")
d3_listing <- drive_ls(d3)

The key idea is to provide file IDs whenever possible instead of a filepath. This is what happens when you specify a target folder with a dribble instead of just its filepath. The approach above does this in a rather boneheaded stepwise way, but hopefully it makes things more clear. There's probably something less ugly that will work, but that should get you started.

gorkang commented 12 months ago

Thanks @jennybc

In the end, my impression is that @juliasilge solved the issue by using basically the technique you hinted here.

Thanks again!