ropensci / osfr

R interface to the Open Science Framework (OSF)
https://docs.ropensci.org/osfr
Other
143 stars 28 forks source link

batch uploads error using walk #99

Open CEBerndsen opened 5 years ago

CEBerndsen commented 5 years ago

When using the osf_upload function in combination with purrr::walk I received an inconsistent error. I could upload ~50 files with no problems.

Later when trying the same basic code with a much larger directory (~1000) files only 700 files uploaded before I received this message:

Error: Encountered an unexpected error with the OSF API Please report this at https://github.com/aaronwolen/osfr/issues

Code that failed with error above:

  osf_retrieve_node("3r7nw") %>%
  osf_mkdir(., path = "2019-3-4 tetramer with amylose in KCl") %>%
  walk(files, osf_upload, x = .)

I adjusted the code figuring it was a timeout issue and tried to complete the upload with:

 osf_retrieve_node("3r7nw") %>%
 osf_ls_files() %>%
 filter(name == "2019-3-4 tetramer with amylose in KCl") %>%
 walk(files, osf_upload, x = ., overwrite = TRUE)

and got this error:

Error: Internal Server Error HTTP status code 500.

Note: overwrite = FALSE failed to work, which is why overwrite is set to TRUE

As I stated originally, the same basic code worked for 50 files, but larger uploads failed to fully complete.

Have enjoyed using the package and this won't stop me from using it, just 1000 files is a standard size project for me so batch uploads without having to use the web interface is really useful.

aaronwolen commented 5 years ago

Thanks for reporting.

The latest version on GitHub (v0.2.3) will re-attempt the request a few times if the API throws a 500 error. Can you try updating and see if the problem persists?

I haven't tried uploading that many files before so I'll do a little testing on my end as well.

CEBerndsen commented 5 years ago

Updated to v0.2.3 and tried uploading again two ways. In first attempt, tried to simply use the update in batch code and got this error:

osf_retrieve_node("3r7nw") %>%
 osf_ls_files() %>%
 filter(name == "2019-3-4 tetramer with amylose in KCl") %>%
 walk(files, osf_upload, x = ., overwrite = TRUE)

Error in data.matrix(data) : (list) object cannot be coerced to type 'double' In addition: Warning messages: 1: In data.matrix(data) : NAs introduced by coercion 2: In data.matrix(data) : NAs introduced by coercion

So I then deleted the folder via the web interface and tried the first code from above which makes the directory and then uploads files to it. Got the 500 error again and only 400 files uploaded.

osf_retrieve_node("3r7nw") %>%
  osf_mkdir(., path = "2019-3-4 tetramer with amylose in KCl") %>%
  walk(files, osf_upload, x = .)

Error: Internal Server Error HTTP status code 500.

Let me know if I can try other approaches and help. Thanks!

aaronwolen commented 5 years ago

Thanks. I ran a couple of tests that attempted to upload 1500 files and was able to reproduce the same error. Unfortunately, sometimes it worked and sometimes it failed. I'm going to leave this open for now. These HTTP codes > 500 correspond to "unexpected errors" on the server, so we may need to loop in one of the OSF devs to ultimately solve it. In the meantime, this highlighted some inefficiencies in osf_upload() that may partially address the issue by reducing the number of API calls made.

brianjgeiger commented 5 years ago

Are many of these files relatively small? Like in the "seconds or less to upload" size?

CEBerndsen commented 5 years ago

Tend to be 10 kB to 5 Mb, just lots of them.

C.E. Berndsen, Ph.D. Department of Chemistry and Biochemistry James Madison University

On Jun 5, 2019, at 4:44 PM, Brian J. Geiger notifications@github.com<mailto:notifications@github.com> wrote:

Are many of these files relatively small? Like in the "seconds or less to upload" size?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_CenterForOpenScience_osfr_issues_99-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DAE43K2QAS2GUHBIFOOXTGTTPZAQTVA5CNFSM4G3UE7S2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXA6YRI-23issuecomment-2D499248197&d=DwMCaQ&c=eLbWYnpnzycBCgmb7vCI4uqNEB9RSjOdn_5nBEmmeq0&r=yjHWhOXV5AnWLtVGr7q7Rw&m=NxdKQQNiZLoJnQJsdVOptcLFVUtfnfxkjx7zHwviQrM&s=STycSUBhgBk19ZPpr1WO1_GTqkCEKJGNIKRG9rksYx4&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AE43K2TTK6UL2VFCVC2Y2TDPZAQTVANCNFSM4G3UE7SQ&d=DwMCaQ&c=eLbWYnpnzycBCgmb7vCI4uqNEB9RSjOdn_5nBEmmeq0&r=yjHWhOXV5AnWLtVGr7q7Rw&m=NxdKQQNiZLoJnQJsdVOptcLFVUtfnfxkjx7zHwviQrM&s=M7qfZQGzz4YCjda_zfMI6d6LZeyLX8jI84J1R637xqQ&e=.

aaronwolen commented 5 years ago

Hi @brianjgeiger, thanks for checking into this. I used hundreds of small text files in my testing. Are you thinking it's rate limiting issue?

brianjgeiger commented 5 years ago

Hi, @aaronwolen, no, I think it's because we have an inefficiency or two on capturing provenance data for file uploads, and it's causing the thread to eventually time out. It should be fixed in an upcoming version, but I don't have a date on that yet. But slowing down the requests will definitely keep you from seeing the error.

aaronwolen commented 5 years ago

Thanks for the info.

It should be fixed in an upcoming version

Is there a relevant PR or Issue I can monitor to determine when it's fixed?

In the meantime, do you have recommendations for parameters I should use to moderate requests (eg, delay x seconds for every x files)?