rstudio / cloudml

R interface to Google Cloud Machine Learning Engine
https://tensorflow.rstudio.com/tools/cloudml/
65 stars 24 forks source link

Cloudml_train and job_collect #210

Open philipus opened 4 years ago

philipus commented 4 years ago

i have a problem by applying mnist_mlp.R (https://github.com/rstudio/keras/blob/master/vignettes/examples/mnist_mlp.R) using cloudml_train on google cloud platform.

Even the job on google ai platforms run properly the job does not finish automatically. Also or because of that the job_collect functionality does not copy any files into local directory (runs)... when I cancel the job manually on google ai platform I see the the new job folder of the corresponding job.

So... why the hack the job runs for ever on google ai platform?!

I think the download functionality does not work properly. I also do not have a local runs directory created as it does in the mnist_mlp.R script. I think job_collect is the problem

cloudml::job_collect('Project Name', destination = '../runs', view = 'save')

does not copy anything in the destination folder

Any Idea what we can do?

R commands:

library(cloudml) cloudml_train("mnist_mlp.R", config = "config.yml")

config.yml:

trainingInput: scaleTier: BASIC runtimeVersion: "2.1" pythonVersion: "3.7"

herambgadgil commented 4 years ago

I had the same problem. The problem is with the below chunk in path-to-library/cloudml/cloudml/cloudml/deploy.py

# Stream output from subprocess to console.
for line in iter(process.stdout.readline, ""):
    sys.stdout.write(line.decode('utf-8'))

Once the execution is completed, this does not does not halt and hence enters a continuous loop.

Resolution : comment out the above chunk from deploy.py and it will give you a successful execution. Downside : you won't be able to see step-by-step installation progress and hence won't get a hint from logs if there is an error in the script. But below chunk will ensure the check on successful execution. If there is an error in the script, it will keep on running endlessly.

# Finalize the process.
stdout, stderr = process.communicate()

# Detect a non-zero exit code.
if process.returncode != 0:
  fmt = "Command %s failed: exit code %s"
  print(fmt % (commands, process.returncode))
else:
  print("Command %s ran successfully." % (commands, ))

Note : Novice in python and cloud environment. Take my comments with pinch of a salt. :-)