pachadotdev / analogsea

Digital Ocean R client
https://pacha.dev/analogsea/
Apache License 2.0
155 stars 24 forks source link

Inconsistent Behavior Creating + Destroying Droplets #190

Closed mscbuck closed 4 years ago

mscbuck commented 4 years ago

I've been having some inconsistent behavior when creating / destroying a droplet. I have a nightly script that simply spins up a machine from a previously saved snapshot, runs a script on that machine, and then destroys the machine. Essentially the simplest possible script I can have. I'd say that 80% of the time, this runs just fine. However, 20% of the time, I get some odd behavior. Below is my script...I added the Sys.sleep(120) just in case it was a problem where I wasn't giving the droplet enough time to create, but not sure if they really are needed with a droplet_wait()

library(plumber)
library(analogsea)
do_oauth()
d <- droplet_create(name="rec-machine",
                          size="g-16vcpu-64gb",
                          region="nyc3",
                          image=xxxxxxx) %>%
         droplet_wait()
Sys.sleep(120)
droplet_ssh(d, "/var/scripts/rec_train_upload.sh")
Sys.sleep(120)
droplet_delete(d)

The errors I'll get are the following, and they always happen after the machine has been created, but before the I make the droplet_ssh() call (otherwise I'd see information from my R script in the log file that I create)

Error: Failed to load data from database

Another one I will often see is:

The resource you were accessing could not be found

Lastly, very few times the machine is created, the script is run (which I can verify in my logs and the fact that files are updated in the repo where I deposit files into), but the machine never deletes, despite the explicit call. Any help or thoughts would be appreciated. It just seems like something in the script "loses track" of the droplet

Session Info ```r > sessionInfo() R version 3.6.0 (2019-04-26) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 16.04.5 LTS Matrix products: default BLAS: /usr/lib/libblas/libblas.so.3.6.0 LAPACK: /usr/lib/lapack/liblapack.so.3.6.0 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] plumber_0.4.6 analogsea_0.6.0 loaded via a namespace (and not attached): [1] Rcpp_1.0.1 later_0.8.0 digest_0.6.19 [4] crayon_1.3.4 aws.signature_0.5.0 R6_2.4.0 [7] jsonlite_1.6 magrittr_1.5 httr_1.4.0 [10] stringi_1.4.3 promises_1.0.1 xml2_1.2.0 [13] tools_3.6.0 aws.s3_0.3.12 httpuv_1.5.1 [16] yaml_2.2.0 compiler_3.6.0 base64enc_0.1-3 ```
sckott commented 4 years ago

thanks for opening an issue @mscbuck

At first glance at those errors, I'd be surprised if they were analogsea related issues. Can you provide context around those errors? During what step do they happen?

right, you shouldn't need sys.sleep if you use droplet_wait

The problem with a droplet not being deleted probably is an analogsea problem. Can you replicate the problem of a droplet not being deleted with a minimal example? Is there an error message? warning? anything?


p.s. analogsea is at v0.7.2 https://cloud.r-project.org/web/packages/analogsea/ you may want to upgrade.

mscbuck commented 4 years ago

Thanks for responding! I would wager that you are correct in that it's not quite an analogsea issue, but just the way the DigitalOcean API is operating.

I put in some print statements to log the exact steps it's happening at, but the only thing I can really point to is that the machine is always created, so it's either something at the droplet_wait() step or in between that and the droplet_ssh() command. Either of those errors above happen (it truly is random, I have not been able to replicate it), the script exits, and therefore doesn't get to the droplet_delete step. Though even then I had the weird handful of instances where neither of those errors happen, the script on the remote machine runs without any errors, but yet the droplet still was running instead of deleting.

I will update this issue with more information as I get it. I did update analogsea so we will see if potentially that may fix something. I also had a thought that potentially something is happening with Rscript. I have never been able to replicate this just running the code manually, and it only seems to happen when it's a scheduled Rscript job.

sckott commented 4 years ago

The one possibility that comes to mind is that droplet_wait uses an R option do.wait_time (seconds to wait between pings to DO API). Its default value is 1 (1 second). So a request every second. It's possible you're reaching a rate limit. You could try setting that option to some larger integer.

Let me know if you try this

mscbuck commented 4 years ago

EDIT: Even after changing the do.wait_time to 10, I'm getting the same Error: The resource you were accessing could not be found.. I will try setting this to a higher value again, as I tried to ssh immediately after the droplet being created and it was not connecting.


At least when it comes to Error: The resource you were accessing could not be found., I think you may be on to something. Looking at my logs, I never got to my print statement, but yet the droplet was created successfully. I will try editing the wait time option and see if this potentially helps.

d <- droplet_create(name="rec-machine",
                          size="g-16vcpu-64gb",
                          region="nyc3",
                          image=xxxxxxx) %>%
         droplet_wait()

print('create done')
sckott commented 4 years ago

Hmm. Not sure. Have you tried re-creating what you are doing with analogsea fxns in the Digital Ocean web interface / ssh'ing in to the instance. Curious if that fails too

mscbuck commented 4 years ago

I am always able to successfully do everything using the web interface. Even after upping the do.wait_time option to 60 seconds, I still am getting the Error: The resource you were accessing could not be found. error. However, I run the script manually and it runs fine literally every time.

I am wondering if there is some very weird interaction between a scheduled cron job / Rscript and this. I have never been able to replicate this error by manually typing 'Rscript /var/scripts/spinup.R' in the terminal. It has only happened when scheduled as a cron-job

0 8 * * * Rscript --no-save --no-restore --verbose /var/scripts/spinup.R > /var/scripts/outputFile.Rout 2> /var/scripts/errorFile.Rout

mscbuck commented 4 years ago

Did not mean to close this!

sckott commented 4 years ago

the first thing that comes to mind with cron is that it doesn't know about env vars if you have any that the script depends on. Does the cron job work fine if not being run from analogsea?

mscbuck commented 4 years ago

EDIT: Nope, disregard everything. After 2 straight days where it runs fine, I got the dreaded Error: The resource you were accessing could not be found.


the first thing that comes to mind with cron is that it doesn't know about env vars if you have any that the script depends on. Does the cron job work fine if not being run from analogsea?

No environment variables in this case, however, I may have stumbled upon a potential answer. It actually was something I needed to do in another script that used the mailR (I think) package to send out e-mails. Apparently, Rscript does not load up the methods package by default. This leads to a lot of unpredictability in certain scripts. I actually added library(methods) specifically to the script, and have not had an error for two days. I have no idea why not including this would cause random errors, but it's something I've seen with a few other packages.

sckott commented 4 years ago

that's good news. hopefully it keeps working. let me know if you think you're all sorted out

mscbuck commented 4 years ago

Just wanted to chime in that even after that I started to get some random errors. But ever since DigitalOcean put out a notice that they were having some issues with droplet creation and the API, and since fixing those, I have yet to have any issues. So the issue did seem to be on DO's end, though probably something really specific to API creation of droplets.

sckott commented 4 years ago

Okay, thanks for the update @mscbuck