Evaluation step for pets demo does not work in Google Cloud ML engine

jhovell commented 6 years ago

When following the step described here in the documentation an error is thrown about pycocotools being missing.

This issue is described here as well as on a Stack Overflow thread, but the workaround/fix described in either place is to install a plaform-specific version of Pycocotools locally. I'm skeptical that installing pycocotools on my Mac is going to fix this running in Google Cloud ML Engine. At the very least I'd expect to somehow have to bundle some Linux variant as a package along with my job. Is there any documentation how to achieve this or is this step currently broken with Google Cloud ML engine?

angerson commented 6 years ago

Can you include the exact commands you are using?

pkulzc commented 6 years ago

Please follow the COCO API installation section described here.

agnellodcosta commented 6 years ago

I have the same issue as @jhovell . The pycocotools are installed on my machine, and so local train and eval works. But GoogleCloud eval does not work, due to the pycococo error. I had uploaded the pycococotools to GoogleCloud via object_detection and slim package, but that too did not resolve the issue.

jhovell commented 6 years ago

thanks @angersson . I basically just used the steps in this demo which links to the "Running on Google Cloud Platform" docs in this repo.

Here is a bash script with the exact command I am running. The constants refer to private Google Cloud buckets used to store my config, eval and training data.

#!/bin/bash

TRAIN_DIR=raccoon-training-d475e1a4
PIPELINE_CONFIG_PATH=raccoon-config-d475e1a4/ssd_mobilenet_v1_pets.config
EVAL_DIR=raccoon-eval-d475e1a4

set -e

gcloud ml-engine jobs submit training object_detection_eval_`date +%s` \
    --runtime-version 1.4 \
    --job-dir=gs://${TRAIN_DIR} \
    --packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz \
    --module-name object_detection.eval \
    --region us-central1 \
    --scale-tier BASIC_GPU \
    -- \
    --checkpoint_dir=gs://${TRAIN_DIR} \
    --eval_dir=gs://${EVAL_DIR} \
    --pipeline_config_path=gs://${PIPELINE_CONFIG_PATH}

Same experience as @agnellodcosta ... So you're saying there is no need to perform extra steps to somehow package pycocotools and submit with my job in google cloud, it's just supposed to work on google cloud ml engine after compiling and installing locally on my mac?

maxwang7 commented 6 years ago

I have tried placing pycocotools in models/research, models/research/object_detection, and models/research/object_detection/metrics, re-run the python setup.py sdist && cd slim && python setup.py sdist commands, and seen the same error each time. I am also receiving a Runtime error saying that I'm using numpy version 0xa when I should be using 0xb, even though I have added numpy==1.11 as a required dependency to models/research/setup.py.

bduman commented 6 years ago

I tried to install pycocotools with editing setup.py

REQUIRED_PACKAGES = ['Pillow>=1.0', 'Matplotlib>=2.1', 'Cython>=0.28', 'pycocotools>=2.0.0']

but failed.

Traceback (most recent call last): File "<string>", line 1, in <module> File "/tmp/pip-build-0CFyea/pycocotools/setup.py", line 2, in <module> from Cython.Build import cythonize ImportError: No module named Cython.Build

I think Cython does not install correctly. Someone any idea?

wenhuiyao commented 6 years ago

Run into the same issue on Google Cloud, any update on this?

agnellodcosta commented 6 years ago

Check this issue, #3431

bduman commented 6 years ago

@agnellodcosta your solution almost same mine. Could you eval on cloud successfully ?

joshbarrington commented 6 years ago

@bduman train works with the solution @agnellodcosta posted, but I cannot run eval. Still have issues with the pycocotools import. Error is as follows:

import pycocotools._mask as _mask ImportError: No module named _mask

any ideas?

choudharydhruv commented 6 years ago

I am getting similar errors. The documentation is definitely missing a step of how to install pycocotools correctly on gcloud. In my opinion it should be a separate package - just as object_detection and slim are. But building it as a package gives me this error Traceback (most recent call last): File "<string>", line 1, in <module> File "/tmp/pip-SqFrWm-build/setup.py", line 23, in <module> cythonize(ext_modules) File "/root/.local/lib/python2.7/site-packages/Cython/Build/Dependencies.py", line 897, in cythonize aliases=aliases) File "/root/.local/lib/python2.7/site-packages/Cython/Build/Dependencies.py", line 777, in create_extension_list for file in nonempty(sorted(extended_iglob(filepattern)), "'%s' doesn't match any files" % filepattern): File "/root/.local/lib/python2.7/site-packages/Cython/Build/Dependencies.py", line 102, in nonempty raise ValueError(error_msg) ValueError: 'pycocotools/_mask.pyx' doesn't match any files

This is because there seems to be a bug in pycocotools setup script

joshbarrington commented 6 years ago

I managed to get eval to run by including pycocotools as a package. Copy the common folder from cocoapi into the cocoapi/PythonAPI directory.

Edit the setup.py file in PythonAPI so any reference to the common folder does not include changing back a directory e.g. change ../common to just common.

In PythonAPI/pycocotools, modify line 2 in the _mask.pyx file also.

Tar the entire PythonAPI file, rename as "pycocotools-2.0" and include this compressed file as a package in the --packages flag when you submit the job to the gcloud ml-engine

joshbarrington commented 6 years ago

Eval is running, but seems to be stuck on the first checkpoint it evaluates from the train directory. New checkpoints are saved but the log Found already evaluated checkpoint. Will try again in 300 seconds is repeated every attempt of evaluation even if new checkpoints are there.

Anyone had or dealt with this issue?

dvoram commented 6 years ago

I had the same issue. As suggested here, setting environment variable GCS_READ_CACHE_MAX_SIZE_MB to 0 helped me.

I simply added these two lines to object_detection/__init__ file:

import os
os.environ['GCS_READ_CACHE_MAX_SIZE_MB'] = '0'

However, I would appreciate any suggestion for a cleaner solution.

nehcgnem commented 6 years ago

Hi @Joshbarrington, thanks for your solution. I have tried your method, however it returned another error.

Command '['pip', 'install', '--user', '--upgrade', '--force-reinstall', '--no-deps', u'pycocotools-2.0']' returned non-zero exit status 1

I have used command tar -czvf pycocotools-2.0 PythonAPI/ to compress it, and use --packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz,pycocotools-2.0 \ to attach the package.

Im a noob in GCP ML, is there anything wrong with my way of doing it? thanks

nehcgnem commented 6 years ago

Hi @Joshbarrington this is the error message with--packages pycocotools-2.0 Could not find a version that satisfies the requirement pycocotools-2.0 (from versions: )

Thanks

nehcgnem commented 6 years ago

@bduman I got the same issue as yours, google cloud couldn't install Cython.build

joshbarrington commented 6 years ago

@nehcgnem make sure your reference in the flag contains the .tar.gz and the path to the compressed file: such as: --packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz, path/to/pycocotools-2.0.tar.gz \

Try adding 'Cython>=0.28.1 ' to your REQUIRED_PACKAGES in setup.py

joshbarrington commented 6 years ago

@dvoram When adding those two lines to file, I get the following corruption error:

W tensorflow/core/framework/op_kernel.cc:1202] OP_REQUIRES failed at iterator_ops.cc:870 : Data loss: corrupted record at 147651135

Is this something you dealt with?

dvoram commented 6 years ago

No, I actually encountered a different error:

I tensorflow/core/platform/cloud/retrying_utils.cc:77] The operation failed and will be automatically retried in 1.08566 seconds (attempt 1 out of 10), caused by: Unavailable: Error executing an HTTP request (HTTP response code 502, error code 0, error message '')

And I "resolved" it by updating to runtime version 1.5.

Nevertheless, your problem seems to really be a file corruption. Is the file readable locally? Perhaps, recreating and reuploading may help...

joshbarrington commented 6 years ago

Is your eval working fine now?

Changing the environment variable causes the corruption error to happen during eval of ckpt-0, whereas before this the evaluation would complete ckpt-0 but hang when moving to the latest one. So the tfrecord file should be fine?

dvoram commented 6 years ago

Yes, my eval works fine, now.

You may try to evaluate od inspect the ckpt file locally, to see if it is really corrupted.

Are you sure, that you use the same config file both for train and eval?

Od: Joshua Barrington notifications@github.com Odesláno: středa 28. března 2018 6:07 odp. Komu: tensorflow/models Kopie: dvoram; Mention Předmět: Re: [tensorflow/models] Evaluation step for pets demo does not work in Google Cloud ML engine (#3470)

Is your eval working fine now?

Changing the environment variable causes the corruption error to happen during eval of ckpt-0, whereas before this the evaluation would complete ckpt-0 but hang when moving to the latest one. So the the tfrecord file should be fine?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ftensorflow%2Fmodels%2Fissues%2F3470%23issuecomment-376942357&data=02%7C01%7C%7Cfdfeef1c7e564054376608d594c5f8ea%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636578500358126930&sdata=nu9YvbcMJc3rZD0VZbOsYbgATbw4GytKuw8y8M2j0r4%3D&reserved=0, or mute the threadhttps://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FANObjbhWGaj1l9DvLmtQrJxaJrqlVvKhks5ti7UwgaJpZM4STwdT&data=02%7C01%7C%7Cfdfeef1c7e564054376608d594c5f8ea%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636578500358126930&sdata=fQmRYpNz9VugyuwIWPrQpAiflrx%2FvJ%2FiragEAfAcSVE%3D&reserved=0.

joshbarrington commented 6 years ago

The test.record worked locally. I think it must be something to do with adding the environment variable, but unsure as to why.

pkulzc commented 6 years ago

@Joshbarrington's comment here is the right solution. I also made a pycocotools-2.0.tar.gz file which can be downloaded from here.

Note that the file I made is not guaranteed to be sync'ed to latest cocoapi, so you may still want to do it by yourselves:

git clone https://github.com/cocodataset/cocoapi.git
cd cocoapi/PythonAPI && mv ../common ./
/ Update all files that has ../common reference - replace "../common" with "common" /
/ Add "REQUIRED_PACKAGES = ['Cython>=0.28.1']" to setup.py /
cd .. && tar -czf pycocotools-2.0.tar.gz PythonAPI/

Edit: copying from fchouteau@'s comment below.

Since gcloud ml engine has no python tk you also have to modify the imports in coco.py

import json
import time
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from matplotlib.collections import PatchCollection
from matplotlib.patches import Polygon

fchouteau commented 6 years ago

Hi,

In order to use the .tar.gz archive from @pkulzc you have to make the following modifications: Add "REQUIRED_PACKAGES = ['Cython>=0.28.1']" to setup.py -> The setup.py file in models/research/setup.py (you should also add matplotlib)

"""Setup script for object_detection."""

from setuptools import find_packages
from setuptools import setup

REQUIRED_PACKAGES = ['Pillow>=1.0', 'matplotlib','Cython>=0.28.1']

setup(
    name='object_detection',
    version='0.1',
    install_requires=REQUIRED_PACKAGES,
    include_package_data=True,
    packages=[p for p in find_packages() if p.startswith('object_detection')],
    description='Tensorflow Object Detection Library',
)

Since gcloud ml engine has no python tk you also have to modify the imports in coco.py

import json
import time
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from matplotlib.collections import PatchCollection
from matplotlib.patches import Polygon

And it should be working

pkulzc commented 6 years ago

@fchouteau Yes that's right. Our next release will include these changes.

jhovell commented 6 years ago

Awesome, thanks for the solution! Is the recommendation still to use the 1.2 runtime as described in the docs or has this been tested on newer runtimes (1.4 or 1.5)?

fchouteau commented 6 years ago

I am currently running runtime 1.6 + python 2.7. You need runtime >= 1.4 for certain tf.contrib.data functions

jhovell commented 6 years ago

Interesting. It's worth noting that 1.2 seems to still be the supported version for the ODAPI. 1.4 was at least for me creating severe/blocking errors in training. This isn't related, but is going to prevent me from running a more modern version than 1.2, though I'd like to for many reasons.

@pkulzc will your next release also be supported for ODAPI on GCE?

aysark commented 6 years ago

Running into this issue using TF 1.6 or 1.7 and python 2.7. Tried what @fchollet suggested and still getting an error on gcp:

Command '['pip', 'install', '--user', '--upgrade', '--force-reinstall', '--no-deps', u'pycocotoolsv2-2.0.tar.gz']' returned non-zero exit status 1

I'm using latest repo code.

pkulzc commented 6 years ago

@jhovell we will have a major release soon and that will support ODAPI on CMLE.

@aysark currently our API don't work with 1.2+ runtimes on CMLE due to a known grpc issue. You can either wait for our next release(likely in a month), or use my 1.2 compatible branch.

aysark commented 6 years ago

@pkulzc but i was able to successfully train it on 1.5 runtime? I just need to run eval job.

pkulzc commented 6 years ago

@aysark The issue in training randomly happens so you may get lucky. Do you have this pycocotoolsv2-2.0.tar.gz uploaded? Make sure the name is correct.

aysark commented 6 years ago

@pkulzc yes, i have it in my dist folder and i send it as part of my job submission cmd: --packages dist/object_detection-0.1.tar.gz,dist/pycocotoolsv2-2.0.tar.gz,slim/dist/slim-0.1.tar.gz

Do i need to also do the new installation step for COCO API in: https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/installation.md#coco-api-installation ? Thanks.

pkulzc commented 6 years ago

@aysark Hmm, how did you get this pycocotools package? Did you follow my comment here? And no, you don't need to do the installation as that is for local run.

aysark commented 6 years ago

@pkulzc yes i followed what you said, i am using your pycocotools-2.0.tar.gz file actually, i just renamed it and i had to make the changes to coco.py for matplotlib import to be:

import matplotlib
matplotlib.use('Agg')

The full stack log is:

INFO    2018-05-15 16:03:23 -0700       ps-replica-0            Installing the package: gs://infinitone/train/packages/b669fb8b24a29d19763394307b74c407c09e029ed3a4fe72a738d2187506f507/pycocotoolsv2-2.0.tar.gz
INFO    2018-05-15 16:03:23 -0700       ps-replica-0            Running command: pip install --user --upgrade --force-reinstall --no-deps pycocotoolsv2-2.0.tar.gz
ERROR   2018-05-15 16:03:23 -0700       ps-replica-0            Traceback (most recent call last):
ERROR   2018-05-15 16:03:23 -0700       ps-replica-0              File "<string>", line 1, in <module>
ERROR   2018-05-15 16:03:23 -0700       ps-replica-0            IOError: [Errno 2] No such file or directory: '/tmp/pip-req-build-0ubV1Q/setup.py'
INFO    2018-05-15 16:03:23 -0700       ps-replica-0                Complete output from command python setup.py egg_info:
INFO    2018-05-15 16:03:23 -0700       ps-replica-0                ----------------------------------------
ERROR   2018-05-15 16:03:24 -0700       ps-replica-0            Traceback (most recent call last):
ERROR   2018-05-15 16:03:24 -0700       ps-replica-0              File "<string>", line 1, in <module>
ERROR   2018-05-15 16:03:24 -0700       ps-replica-0            IOError: [Errno 2] No such file or directory: '/tmp/pip-req-build-qWLBY1/setup.py'
INFO    2018-05-15 16:03:24 -0700       ps-replica-0                ----------------------------------------
ERROR   2018-05-15 16:03:24 -0700       ps-replica-0            Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-req-build-qWLBY1/
INFO    2018-05-15 16:03:24 -0700       ps-replica-0            Clean up finished.

pkulzc commented 6 years ago

I think I was able to run eval without making changes to coco.py, could you please try use the original packet?

aysark commented 6 years ago

@pkulzc when i do just your original package, i get a diff error:


ERROR   2018-05-15 19:46:37 -0700       service         The replica ps 2 exited with a non-zero status of 1. Termination reason: Error.
ERROR   2018-05-15 19:46:37 -0700       service         Traceback (most recent call last):
ERROR   2018-05-15 19:46:37 -0700       service           [...]
ERROR   2018-05-15 19:46:37 -0700       service           File "/root/.local/lib/python2.7/site-packages/object_detection/evaluator.py", line 24, in <module>
ERROR   2018-05-15 19:46:37 -0700       service             from object_detection import eval_util
ERROR   2018-05-15 19:46:37 -0700       service           File "/root/.local/lib/python2.7/site-packages/object_detection/eval_util.py", line 28, in <module>
ERROR   2018-05-15 19:46:37 -0700       service             from object_detection.metrics import coco_evaluation
ERROR   2018-05-15 19:46:37 -0700       service           File "/root/.local/lib/python2.7/site-packages/object_detection/metrics/coco_evaluation.py", line 20, in <module>
ERROR   2018-05-15 19:46:37 -0700       service             from object_detection.metrics import coco_tools
ERROR   2018-05-15 19:46:37 -0700       service           File "/root/.local/lib/python2.7/site-packages/object_detection/metrics/coco_tools.py", line 47, in <module>
ERROR   2018-05-15 19:46:37 -0700       service             from pycocotools import coco
ERROR   2018-05-15 19:46:37 -0700       service           File "/root/.local/lib/python2.7/site-packages/pycocotools/coco.py", line 49, in <module>
ERROR   2018-05-15 19:46:37 -0700       service             import matplotlib.pyplot as plt
ERROR   2018-05-15 19:46:37 -0700       service           File "/root/.local/lib/python2.7/site-packages/matplotlib/pyplot.py", line 115, in <module>
ERROR   2018-05-15 19:46:37 -0700       service             _backend_mod, new_figure_manager, draw_if_interactive, _show = pylab_setup()
ERROR   2018-05-15 19:46:37 -0700       service           File "/root/.local/lib/python2.7/site-packages/matplotlib/backends/__init__.py", line 62, in pylab_setup
ERROR   2018-05-15 19:46:37 -0700       service             [backend_name], 0)
ERROR   2018-05-15 19:46:37 -0700       service           File "/root/.local/lib/python2.7/site-packages/matplotlib/backends/backend_tkagg.py", line 4, in <module>
ERROR   2018-05-15 19:46:37 -0700       service             from . import tkagg  # Paint image to Tk photo blitter extension.
ERROR   2018-05-15 19:46:37 -0700       service           File "/root/.local/lib/python2.7/site-packages/matplotlib/backends/tkagg.py", line 5, in <module>
ERROR   2018-05-15 19:46:37 -0700       service             from six.moves import tkinter as Tk
ERROR   2018-05-15 19:46:37 -0700       service           File "/usr/local/lib/python2.7/dist-packages/six.py", line 203, in load_module
ERROR   2018-05-15 19:46:37 -0700       service             mod = mod._resolve()
ERROR   2018-05-15 19:46:37 -0700       service           File "/usr/local/lib/python2.7/dist-packages/six.py", line 115, in _resolve
ERROR   2018-05-15 19:46:37 -0700       service             return _import_module(self.mod)
ERROR   2018-05-15 19:46:37 -0700       service           File "/usr/local/lib/python2.7/dist-packages/six.py", line 82, in _import_module
ERROR   2018-05-15 19:46:37 -0700       service             __import__(name)
ERROR   2018-05-15 19:46:37 -0700       service           File "/usr/lib/python2.7/lib-tk/Tkinter.py", line 42, in <module>
ERROR   2018-05-15 19:46:37 -0700       service             raise ImportError, str(msg) + ', please install the python-tk package'
ERROR   2018-05-15 19:46:37 -0700       service         ImportError: No module named _tkinter, please install the python-tk package

pkulzc commented 6 years ago

My bad, you do need to matplotlib.use('Agg') to coco.py, sorry for the confusion.

From the error message this issue looks like a pip version issue and you probably want to talk to someone from GCP.

aysark commented 6 years ago

GCP has nothing to do with it...

Reverted back to an older branch and eval works, not sure why latest code regresses on basic functionality.

pourhadi commented 6 years ago

@aysark Which branch?

I've tried every suggestion here and have not been able to run the evaluation job successfully.

First I get import coco ImportError: No module named pycocotools

So I try including the package above, but that gets me ImportError: No module named _tkinter, please install the python-tk package

So I make the modifications to coco.py, and I get IOError: [Errno 2] No such file or directory: '/tmp/pip-req-build-qWLBY1/setup.py'

aysark commented 6 years ago

@pourhadi its not a specific branch, i just went back in time. My HEAD is acouple months back:

commit 3cb798fe02c9c627541e1d7f1816240a17dd02f3 (HEAD -> master)
Merge: f729a8c 95b0b03
Author: Chris Shallue <cshallue@users.noreply.github.com>
Date:   Fri Jan 19 14:33:52 2018 -0800

Also note, i had to modify some files to fix some other things outlined in another issue (sorry can't recall which one)- but that applies only if you are doing object detection.

I'm able to successfully train with latest code, but i run my eval job with old code.

chris1869 commented 6 years ago

@fchouteau did you add MANIFEST.in?

Tried your solution but maskAPI.h was not included in pycocotools-2.0.tar.gz, when using python setup.py sdist and gcloud eval job won't work due to missing maskAPI.h. Hence: git clone https://github.com/cocodataset/cocoapi.git cd cocoapi/PythonAPI && mv ../common ./ / Update all files that has ../common reference - replace "../common" with "common" / / Add "REQUIRED_PACKAGES = ['Cython>=0.28.1']" to setup.py / /* update pycocotools/coco.py as layed out by @fchouteau echo "graft ./common/" > MANIFEST.in python setup.py sdist

Using setup.py sdist is probably closer to the way object detection is packaged.

bryantharpe commented 6 years ago

@pkulzc Do you know of any updates on this for object detection with ml-engine? I was able to train successfully, but eval is failing similar to others. I'm using the modified pycoco library as well as a modified setup.py to include Cython. This got me to please install the python-tk package' ImportError: No module named _tkinter, please install the python-tk package

Is this something that should be added via the setup.py or should this come preinstalled on the cloud ml -engine instance image?

This is the command i'm running:

gcloud ml-engine jobs submit training `whoami`_object_detection_eval_`date +%s` \
    --job-dir=${YOUR_GCS_BUCKET}/train \
    --packages dist/object_detection-0.1.tar.gz,dist/pycocotools-2.0.tar.gz,slim/dist/slim-0.1.tar.gz \
    --module-name object_detection.eval \
    --runtime-version 1.5 \
    --region us-central1 \
    --scale-tier BASIC_GPU \
    -- \
    --checkpoint_dir=${YOUR_GCS_BUCKET}/train \
    --eval_dir=${YOUR_GCS_BUCKET}/eval \
    --pipeline_config_path=${YOUR_GCS_BUCKET}/data/faster_rcnn_resnet101_custom.config

Here is my setup.py:

from setuptools import find_packages
from setuptools import setup
#import os

REQUIRED_PACKAGES = ['Pillow>=1.0', 'protobuf>=3.3.0', 'Matplotlib>=2.1', 'Cython>=0.28.1']

#os.environ["PATH"] += os.pathsep + '/root/.local/bin'

setup(
    name='object_detection',
    version='0.1',
    install_requires=REQUIRED_PACKAGES,
    include_package_data=True,
    packages=[p for p in find_packages() if p.startswith('object_detection')],
    description='Tensorflow Object Detection Library',
)

pourhadi commented 6 years ago

@bryantharpe I fixed that issue by doing the same thing @chris1869 did, I'd say that's worth a try

bryantharpe commented 6 years ago

@pourhadi I looked through that list and it looks like everything in --> https://storage.googleapis.com/object-detection-dogfood/data/pycocotools-2.0.tar.gz <-- here has the changes @chris1869 listed out. I tried extracting it and reruning setup.py on it then recompressing it but i'm still getting the same error? I've tried on both 1.5 and 1.6 runtime.

I'm guessing i'm missing something tiny?

pkulzc commented 6 years ago

@bryantharpe You need to do update the tar as mentioned here

bryantharpe commented 6 years ago

Thanks @pkulzc and @pourhadi I forgot to add one import. It's running eval fine now.

perkoren commented 6 years ago

Steps mentioned by @pkulzc helped to solve "No module named 'pycocotools'" issue, but after that I've got out of memory errors: 'replica ps 0 ran out-of-memory and exited with a non-zero status of 247.'

When I got out-of-memory in training I tweaked cloud.yml provided as a --config parameter and it worked. In evaluation this parameter seems not to be taken into account?

My cloud.yml configuration:

trainingInput: runtimeVersion: "1.5" scaleTier: CUSTOM masterType: complex_model_m_gpu workerCount: 3 workerType: complex_model_m_gpu parameterServerCount: 3 parameterServerType: complex_model_m

pkulzc commented 6 years ago

In eval you set scaleTier to BASIC_GPU instead of CUSTOM, so no config is needed. See https://cloud.google.com/ml-engine/reference/rest/v1/projects.jobs#scaletier

tensorflow / models

Evaluation step for pets demo does not work in Google Cloud ML engine #3470

Here is my setup.py: