rstudio / cloudml

R interface to Google Cloud Machine Learning Engine
https://tensorflow.rstudio.com/tools/cloudml/
65 stars 24 forks source link

Terminal crashes on windows but job completes. #145

Open andrie opened 6 years ago

andrie commented 6 years ago

This may not be an R issues, but something on the CloudML end.

I received a crash report in the terminal, despite the job still running on CloudML.

This happens after submitting:

cloudml::cloudml_train(...)

Terminal output:

INFO    2018-04-11 17:03:18 +0100       master-replica-0                Copying gs://adv-cloudml-test-195616/r-% Done
cloudml/cache/ubuntu_16044_lts/r_3_4_4/r/hms.tar...
INFO    2018-04-11 17:03:18 +0100       master-replica-0                Copying gs://adv-cloudml-test-195616/r-
cloudml/cache/ubuntu_16044_lts/r_3_4_4/r/cloudml.tar...
INFO    2018-04-11 17:03:18 +0100       master-replica-0                Copying gs://adv-cloudml-test-195616/r-
cloudml/cache/ubuntu_16044_lts/r_3_4_4/r/digest.tar...
INFO    2018-04-11 17:03:18 +0100       master-replica-0                / [0/48 files][    0.0 B/ 61.0 MiB]   0
% Done
IERROR: gcloud crashed (IOError): [Errno 0] Error

If you would like to report this issue, please run the following command:
  gcloud feedback

To check gcloud for common problems, please run the following command:
  gcloud info --run-diagnostics
>>> Job 'cloudml_2018_04_11_155929102' is currently running -- please wait...
>>> [state: RUNNING; last updated 2018-04-11 17:03:48]
Execution halted
Error in shell.exec(url) :
  'C:/Users/apdev/OneDrive/github/experiments/cloudml-deployment/runs/cloudml_2018_04_11_155929102/tfruns.d/vie
w.html' not found
Calls: <Anonymous> -> shell.exec
Execution halted
andrie commented 6 years ago

This still happens. Another terminal dump, in case it helps:

INFO    2018-04-24 14:54:35 +0100       master-replica-0                / [5/48 files][900.0 KiB/ 61.0
MiB]   1% Done
INFO    2018-04-24 14:54:35 +0100       master-replica-0                Copying gs://adv-cloudml-test-1
95616/r-cloudml/cache/ubuntu_16044_lts/r_3_4_4/r/packrat.tar...
INFO    2018-04-24 14:54:35 +0100       master-replica-0                / [6/48 files][  3.0 MiB/ 61.0
MiB]   4% Done
IERROR: gcloud crashed (IOError): [Errno 0] Error
javierluraschi commented 6 years ago

Most likely, this is external and we would need a consistent repro to open an issue with Google CloudML. I've seen this a couple times, but I can't hit this consistently.

philipus commented 4 years ago

got the same problem by applying mnist_mlp.R (https://github.com/rstudio/keras/blob/master/vignettes/examples/mnist_mlp.R) using cloudml_train on google cloud platform.

I think the download functionality does not work properly. I also do not have a local runs directory created as it does in the mnist_mlp.R script. I think job_collect is the problem

cloudml::job_collect('Project Name', destination = '../runs', view = 'save')

does not copy anything in the destination folder

Any Idea what we can do?

R commands:

library(cloudml) cloudml_train("mnist_mlp.R", config = "config.yml")

config.yml:

trainingInput: scaleTier: BASIC runtimeVersion: "2.1" pythonVersion: "3.7"

philipus commented 4 years ago

Most likely, this is external and we would need a consistent repro to open an issue with Google CloudML. I've seen this a couple times, but I can't hit this consistently.

did we make some progress here. I just saw that the issue is open for a long time