rstudio / cloudml

R interface to Google Cloud Machine Learning Engine
https://tensorflow.rstudio.com/tools/cloudml/
65 stars 24 forks source link

Error setting up cloud instance #220

Open jnmaloof opened 3 years ago

jnmaloof commented 3 years ago

Submitting first job through cloudml, and there are errors on the cloud install.

Log from google cloud console:

The replica master 0 exited with a non-zero status of 1. 
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/tmp/pip-req-build-kgurwctg/setup.py", line 163, in <module>
    cmdclass         = { "install": CustomCommands }
  File "/opt/conda/lib/python3.7/site-packages/setuptools/__init__.py", line 161, in setup
    return distutils.core.setup(**attrs)
  File "/opt/conda/lib/python3.7/distutils/core.py", line 148, in setup
    dist.run_commands()
  File "/opt/conda/lib/python3.7/distutils/dist.py", line 966, in run_commands
    self.run_command(cmd)
  File "/opt/conda/lib/python3.7/distutils/dist.py", line 985, in run_command
    cmd_obj.run()
  File "/tmp/pip-req-build-kgurwctg/setup.py", line 138, in run
    self.RunCustomCommandList(PIP_INSTALL_KERAS)
  File "/tmp/pip-req-build-kgurwctg/setup.py", line 119, in RunCustomCommandList
    self.RunCustomCommand(command, True)
  File "/tmp/pip-req-build-kgurwctg/setup.py", line 102, in RunCustomCommand
    raise RuntimeError(message)
RuntimeError: Command ['pip', 'install', 'h5py', 'pyyaml', 'requests', 'Pillow', 'scipy', '--upgrade'] failed: exit code 1

To find out more about why your job exited please check the logs: https://console.cloud.google.com/logs/viewer?project=894990183050&resource=ml_job%2Fjob_id%2Fcloudml_2021_02_18_030427919&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%22cloudml_2021_02_18_030427919%22
jnmaloof commented 3 years ago

To explore this a bit more, I:

1) installed cloudml from this repository rather than CRAN

2) Used the example mnist script:

library(cloudml)
dir.create("mnist-train")
file.copy(system.file("examples/mnist/train.R", package = "cloudml"), "mnist-train")
setwd("mnist-train")
cloudml_train()

The first error that pops up is probably not consequential:

ERROR: You have configured your Cloud SDK installation to be fixed to version [220.0.0]. Make sure this is a valid archived Cloud SDK version.

But things seem to go wrong when installing matrix 1.3-2, where I get:

curl: (22) The   requested URL returned error: 404 Not Found
FAILED
Error in getSourceForPkgRecord(pkgRecord,   srcDir(project), availablePackagesSource(repos = repos),  :
Failed to retrieve package sources for Matrix 1.3-2 from CRAN   (internet connectivity issue?)
Calls: retrieve_packrat_packages ...   restoreImpl -> playActions -> installPkg -> getSourceForPkgRecord
Execution halted
Command ['Rscript',   '/root/.local/lib/python3.7/site-packages/cloudml-model/cloudml/deploy.R']   failed: exit code 1
Command '['python3', '-m',   'cloudml-model.cloudml.deploy', 'Rscript', '--job-dir',   'gs://jm-dl-r-2/r-cloudml/staging']' returned non-zero exit status 1.

full logs in csv and JSON attached

Archive.zip