Closed robkooper closed 5 years ago
A potential problem discovered by the NDS team is that an update was made to the docker image by @dlebauer (https://github.com/terraref/toolserver-dockerfiles/commit/aa31ca7be689a2c3aaeaaa43d53dda51e4437473) which increased the version number to 3.5.1. It looks like this now requires a different password than rstudio.
Either way we should make sure we set username and password for Rstudio so we don't run into this problem in the future.
Also Mike/Craig believes this might be an issue with permissions which is fixed in the newer version.
From the original report, it was not clear what exactly was the problematic behavior - "not working" has many definitions that vary from person to person. In the future, providing clear reproducible steps to recreate the issue is definitely a more helpful way to ask for assistance.
This feels like invalid/leftover state to me... In the linked R forum post (which I had not seen before), one user states that in their instance, the issue stemmed from an invalid or missing $HOME/.rstudio
directory.
The error that I am seeing in the console seems to indicate a version change of R took place, which may have put things into an invalid state:
> library(traits)
> knitr::opts_chunk$set(echo = FALSE, cache = TRUE)
> library(ggplot2)
> library(ggthemes)
> library(GGally)
> theme_set(theme_bw())
> library(dplyr)
> library(httr)
> .libPaths('~/R/library')
>
> options(betydb_key = readLines('~/.betykey_public', warn = FALSE),
+ betydb_url = "https://terraref.ncsa.illinois.edu/bety-test/",
+ betydb_api_version = 'beta')
Error in file(con, "r") : cannot open the connection
23 May 2018 21:38:27 [rsession-rstudio] ERROR session hadabend; LOGGED FROM: rstudio::core::Error {anonymous}::rInit(const rstudio::r::session::RInitInfo&) /home/ubuntu/rstudio/src/cpp/session/SessionMain.cpp:563
R version change [3.4.0 -> 3.4.2] detected when restoring session; search path not restored
09 Aug 2018 15:06:14 [rsession-rstudio] ERROR session hadabend; LOGGED FROM: rstudio::core::Error {anonymous}::rInit(const rstudio::r::session::RInitInfo&) /home/ubuntu/rstudio/src/cpp/session/SessionMain.cpp:563
09 Aug 2018 15:16:07 [rsession-rstudio] ERROR r error 4 (R code execution error) [errormsg=]; OCCURRED AT: rstudio::core::Error rstudio::r::exec::executeSafely(rstudio_boost::function<void()>) /home/ubuntu/rstudio/src/cpp/r/RExec.cpp:212; LOGGED FROM: void rstudio::session::{anonymous}::processEvents() /home/ubuntu/rstudio/src/cpp/session/SessionHttpMethods.cpp:91
>
> clear
Error: object 'clear' not found
[2018-10-19 20:18:41] [error] handle_read_frame error: websocketpp.transport:7 (End of File)
[2018-10-19 20:18:41] [fatal] handle_write_frame error: websocketpp.transport:10 (A transport action was requested after shutdown)
[2018-10-19 21:38:26] [error] handle_read_frame error: websocketpp.transport:7 (End of File)
17 Dec 2018 17:34:53 [rsession-rstudio] ERROR session hadabend; LOGGED FROM: rstudio::core::Error {anonymous}::rInit(const rstudio::r::session::RInitInfo&) /home/ubuntu/rstudio/src/cpp/session/SessionMain.cpp:563
R version change [ -> 3.4.2] detected when restoring session; search path not restored
The line reading R version change [3.4.0 -> 3.4.2] detected when restoring session; search path not restored
stood out to me and, although I don't understand R or what most of this output is supposed to be doing, I tried mapping in a empty directory in for /home/rstudio/.rstudio
.
Sure enough after starting up, RStudio showed no errors in the console after logging in. But without a clear reproducible test case, I cannot be sure that this has actually done anything to fix the problem.
@bodom0015 I tried to clarify rob's original report in the post above. It is the 'cannot connect to service' issue that is concerning. It is intermittent but is an error that makes the workbench unwelcoming to new users.
I have not seen the R version change errors in the past, and the terra rstudio has been building off of the rocker/geospatial:3.4.2 for almost a year. Currently I can't reproduce the error since I am getting 502 errors.
And failure to launch errors seems to happen intermittently on other applications as well. For example, I received this in an email last Friday:
The file manager definitely errored out. I think the Jupyter notebook opened a new tab but the “wheel” just spun. Just now, Rstudio actually open up a tab and let me log in with rstudio/rstudio but it wasn’t able to connect to service.
@bodom0015 also I just tried to run the code you sent and I got the 'cannot open connection error' (which is expected if you don't have a ~/.betykey_public file
but not the error related to the version of R (and I haven't seen that one before). Do you by chance have non-default debugging options set?
I am reusing the demo
account on the terraref system for testing - I do not have an account of my own. While I haven't explicitly set any additional configuration, I don't know what configuration stuff is buried in that account's $HOME/.rstudio
directory.
We seem to be conflating at least 2 issues here:
These, ideally, should be filed as separate issues, as they are likely unrelated and will probably both have distinct fixes.
I have not yet experienced any issues starting up a fresh instance of RStudio, except for one case: if Workbench needs to pull a new Docker image that is very large. This sounds like your reported issue where the "wheel just spun" - this is a known issue with Nebula where these image pulls timeout before finishing. In later versions of Kubernetes, a --image-pull-progress-deadline
flag has been added that can help with these errors, but the underlying issue with Nebula storage performance persists.
The only other odd performance thing I've seen comes from general network connectivity errors or timeouts. For example:
I should also note that if I reach this point, refreshing the page will get me a working RStudio instance.
Thanks to @bodom0015 for looking into this. I've also confirmed that I can start Rstudio under my account without error this morning. @dlebauer -- are you seeing this specific error with your instance of Rstudio?
As noted above, there are multiple issues reported here that need to be distinguished:
The "Unable to connect to service" is a generic error from Rstudio with multiple underlying causes. The most recent cause that I reported to @dlebauer was a permissions/ownership problem on the .rstudio
directory. I've confirmed this morning that ownership looks OK for all current users. It would be helpful to have specific usernames to confirm whether this is indeed the cause and if this is the specific error they are experiencing. As noted in the previous comment, this can also occur when the Rstudio client times out connecting to the running service -- which has numerous causes. We should also not conflate this specific Rstudio error with other WB errors, if possible.
As noted by @bodom0015, the remaining errors are related to running on Nebula. As with other Terra services, WB requires constant monitoring and intervention. All Workbench services run on node1 and node2. Common problems include node failures, filesystem failures, NFS failures, updates to images requiring pulls, etc. If the nodes are in a NotReady
state, then users will likely experience problems accessing WB services.
Just to put this in perspective -- this morning I noticed the jupyter images were not present on node1 and node2 and the basic docker pull
has taken > 1 hour to complete on both nodes. One solution we have for the extractor images is to add a cron task that pulls the images periodically. I thought I had put this in place on the WB nodes, but apparently it's only on the extractor nodes:
For example, on node3:
This would ensure that the images were pulled outside of Kubernetes to overcome the timeout limitation.
The system is overdue for a Kubernetes upgrade, which will require complete re-provisioning and reconfiguration. This might improve things. I don't know who would have cycles to take this on. Upgrading Kubernetes would also allow for parallel deployment of JH, if you want to consider it as an alternative.
I've enabled the cache-images timer on node1 and node 2. Ideally this will reduce the number of problems related to docker pull.
I still receive the attached when navigating to Rstudio from Analysis Workbench Applications
Thank you @phloz. And that is indeed the permissions problem.
18 Dec 2018 18:52:50 [rsession-rstudio] ERROR system error 13 (Permission denied) [path=/home/rstudio/.rstudio, target-dir=]; OCCURRED AT: rstudio::core::Error rstudio::core::FilePath::createDirectory(const string&) const /home/ubuntu/rstudio/src/cpp/core/FilePath.cpp:846; LOGGED FROM: int main(int, char* const*) /home/ubuntu/rstudio/src/cpp/session/SessionMain.cpp:1689
I believe I've fixed your account and I now understand how to repeat this. Let me know if otherwise.
I just tried to launch an Rstudio instance that may pre-date the fix above, but encountered the 'unable to connect to service' error again
https://s89gru-rstudio-8787.workbench.terraref.org
I shut it down, restarted, and then it logged in and works. However, this is the type of error that should be hidden from a regular user; i.e. if it won't be possible to connect to service the application should be shut down (in the red state rather than the green ready to launch state) in workbench.
it's possible that unable to connect instance was killed when node 1/2 were rebooted in mid December after some issues with the kubelet service.
Just ran into the same 'unable to connect' error (https://github.com/terraref/computing-pipeline/issues/545#issuecomment-451250288) again trying to launch a brand new Rstudio application.
This is not the same problem as described above (permissions). Rstudio reports the "unable to connect" for a variety of reasons.
Looking at the logs, it appears to be exiting because of the password issue discussed above:
$ kubectl get pods -n dlebauer
NAME READY STATUS RESTARTS AGE
s89gru-rstudio-zpt2d 1/1 Running 0 7d
sjch1p-rstudio-p5f2l 1/1 Running 0 1h
sn9prs-pgstudio-r0b2l 1/1 Running 0 7d
spe9x9-cloudcmd-g56qs 1/1 Running 0 16m
$ kubectl logs -n dlebauer sjch1p-rstudio-p5f2l
[fix-attrs.d] applying owners & permissions fixes...
[fix-attrs.d] 00-runscripts: applying...
[fix-attrs.d] 00-runscripts: exited 0.
[fix-attrs.d] done.
[cont-init.d] executing container initialization scripts...
[cont-init.d] add: executing...
Nothing additional to add
[cont-init.d] add: exited 0.
[cont-init.d] userconf: executing...
ERROR: You must set a unique PASSWORD (not 'rstudio') first! e.g. run with:
docker run -e PASSWORD=<YOUR_PASS> -p 8787:8787 rocker/rstudio
[cont-init.d] userconf: exited 1.
[cont-init.d] done.
[services.d] starting services
[services.d] done.
You may be able to workaround this by adding a PASSWORD environment variable your instance. Otherwise, this will need to be fixed in https://github.com/terraref/workbench-catalog/blob/master/terraref/rstudio.json.
This should probably be tracked as a separate issue.
We've opened two new issues to resolve the remaining issues: directories not appearing in /data (#548) and add PASSWORD envt var (#551)
This issue should probably remain open until an updated WB apiserver image has been tested and deployed. We have a fix in the works:
https://github.com/nds-org/ndslabs/compare/1.0.12...1.0.12-hotfix
This will need to be deployed to terra-ref in two steps:
ndslabs/apiserver:1.0.12-hotfix
chown -R 1000:100 /var/glfs/global/
on the WB nodesThe permissions problem currently only effects newly registered users. So after this is deployed, a test case would be to:
@craig-willis is this still in the works? we are waiting to close this until you push the new WB image but obviously you aren't on TERRA anymore and we arent tracking NDSLabs updates...
after talking with Rob, we are going to close unless you feel it should be reopened.
There are reports of rstudio no longer working in the workbench.
additional details from dlebauer: