Closed jhovell closed 6 years ago
Can you include the exact commands you are using?
I have the same issue as @jhovell . The pycocotools are installed on my machine, and so local train and eval works. But GoogleCloud eval does not work, due to the pycococo error. I had uploaded the pycococotools to GoogleCloud via object_detection and slim package, but that too did not resolve the issue.
thanks @angersson . I basically just used the steps in this demo which links to the "Running on Google Cloud Platform" docs in this repo.
Here is a bash script with the exact command I am running. The constants refer to private Google Cloud buckets used to store my config, eval and training data.
#!/bin/bash
TRAIN_DIR=raccoon-training-d475e1a4
PIPELINE_CONFIG_PATH=raccoon-config-d475e1a4/ssd_mobilenet_v1_pets.config
EVAL_DIR=raccoon-eval-d475e1a4
set -e
gcloud ml-engine jobs submit training object_detection_eval_`date +%s` \
--runtime-version 1.4 \
--job-dir=gs://${TRAIN_DIR} \
--packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz \
--module-name object_detection.eval \
--region us-central1 \
--scale-tier BASIC_GPU \
-- \
--checkpoint_dir=gs://${TRAIN_DIR} \
--eval_dir=gs://${EVAL_DIR} \
--pipeline_config_path=gs://${PIPELINE_CONFIG_PATH}
Same experience as @agnellodcosta ... So you're saying there is no need to perform extra steps to somehow package pycocotools and submit with my job in google cloud, it's just supposed to work on google cloud ml engine after compiling and installing locally on my mac?
I have tried placing pycocotools in models/research
, models/research/object_detection
, and models/research/object_detection/metrics
, re-run the python setup.py sdist && cd slim && python setup.py sdist
commands, and seen the same error each time. I am also receiving a Runtime error saying that I'm using numpy version 0xa when I should be using 0xb, even though I have added numpy==1.11
as a required dependency to models/research/setup.py
.
I tried to install pycocotools with editing setup.py
REQUIRED_PACKAGES = ['Pillow>=1.0', 'Matplotlib>=2.1', 'Cython>=0.28', 'pycocotools>=2.0.0']
but failed.
Traceback (most recent call last): File "<string>", line 1, in <module> File "/tmp/pip-build-0CFyea/pycocotools/setup.py", line 2, in <module> from Cython.Build import cythonize ImportError: No module named Cython.Build
I think Cython does not install correctly. Someone any idea?
Run into the same issue on Google Cloud, any update on this?
Check this issue, #3431
@agnellodcosta your solution almost same mine. Could you eval on cloud successfully ?
@bduman train works with the solution @agnellodcosta posted, but I cannot run eval. Still have issues with the pycocotools import. Error is as follows:
import pycocotools._mask as _mask ImportError: No module named _mask
any ideas?
I am getting similar errors. The documentation is definitely missing a step of how to install pycocotools correctly on gcloud. In my opinion it should be a separate package - just as object_detection and slim are. But building it as a package gives me this error
Traceback (most recent call last): File "<string>", line 1, in <module> File "/tmp/pip-SqFrWm-build/setup.py", line 23, in <module> cythonize(ext_modules) File "/root/.local/lib/python2.7/site-packages/Cython/Build/Dependencies.py", line 897, in cythonize aliases=aliases) File "/root/.local/lib/python2.7/site-packages/Cython/Build/Dependencies.py", line 777, in create_extension_list for file in nonempty(sorted(extended_iglob(filepattern)), "'%s' doesn't match any files" % filepattern): File "/root/.local/lib/python2.7/site-packages/Cython/Build/Dependencies.py", line 102, in nonempty raise ValueError(error_msg) ValueError: 'pycocotools/_mask.pyx' doesn't match any files
This is because there seems to be a bug in pycocotools setup script
I managed to get eval to run by including pycocotools as a package. Copy the common folder from cocoapi into the cocoapi/PythonAPI directory.
Edit the setup.py file in PythonAPI so any reference to the common folder does not include changing back a directory e.g. change ../common
to just common
.
In PythonAPI/pycocotools, modify line 2 in the _mask.pyx file also.
Tar the entire PythonAPI file, rename as "pycocotools-2.0" and include this compressed file as a package in the --packages flag when you submit the job to the gcloud ml-engine
Eval is running, but seems to be stuck on the first checkpoint it evaluates from the train directory. New checkpoints are saved but the log Found already evaluated checkpoint. Will try again in 300 seconds
is repeated every attempt of evaluation even if new checkpoints are there.
Anyone had or dealt with this issue?
I had the same issue. As suggested here, setting environment variable GCS_READ_CACHE_MAX_SIZE_MB to 0 helped me.
I simply added these two lines to object_detection/__init__
file:
import os
os.environ['GCS_READ_CACHE_MAX_SIZE_MB'] = '0'
However, I would appreciate any suggestion for a cleaner solution.
Hi @Joshbarrington, thanks for your solution. I have tried your method, however it returned another error.
Command '['pip', 'install', '--user', '--upgrade', '--force-reinstall', '--no-deps', u'pycocotools-2.0']' returned non-zero exit status 1
I have used command tar -czvf pycocotools-2.0 PythonAPI/
to compress it, and use --packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz,pycocotools-2.0 \
to attach the package.
Im a noob in GCP ML, is there anything wrong with my way of doing it? thanks
Hi @Joshbarrington this is the error message with--packages pycocotools-2.0
Could not find a version that satisfies the requirement pycocotools-2.0 (from versions: )
Thanks
@bduman I got the same issue as yours, google cloud couldn't install Cython.build
@nehcgnem make sure your reference in the flag contains the .tar.gz and the path to the compressed file:
such as: --packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz, path/to/pycocotools-2.0.tar.gz \
Try adding 'Cython>=0.28.1 ' to your REQUIRED_PACKAGES in setup.py
@dvoram When adding those two lines to file, I get the following corruption error:
W tensorflow/core/framework/op_kernel.cc:1202] OP_REQUIRES failed at iterator_ops.cc:870 : Data loss: corrupted record at 147651135
Is this something you dealt with?
No, I actually encountered a different error:
I tensorflow/core/platform/cloud/retrying_utils.cc:77] The operation failed and will be automatically retried in 1.08566 seconds (attempt 1 out of 10), caused by: Unavailable: Error executing an HTTP request (HTTP response code 502, error code 0, error message '')
And I "resolved" it by updating to runtime version 1.5.
Nevertheless, your problem seems to really be a file corruption. Is the file readable locally? Perhaps, recreating and reuploading may help...
Is your eval working fine now?
Changing the environment variable causes the corruption error to happen during eval of ckpt-0, whereas before this the evaluation would complete ckpt-0 but hang when moving to the latest one. So the tfrecord file should be fine?
Yes, my eval works fine, now.
You may try to evaluate od inspect the ckpt file locally, to see if it is really corrupted.
Are you sure, that you use the same config file both for train and eval?
Od: Joshua Barrington notifications@github.com Odesláno: středa 28. března 2018 6:07 odp. Komu: tensorflow/models Kopie: dvoram; Mention Předmět: Re: [tensorflow/models] Evaluation step for pets demo does not work in Google Cloud ML engine (#3470)
Is your eval working fine now?
Changing the environment variable causes the corruption error to happen during eval of ckpt-0, whereas before this the evaluation would complete ckpt-0 but hang when moving to the latest one. So the the tfrecord file should be fine?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ftensorflow%2Fmodels%2Fissues%2F3470%23issuecomment-376942357&data=02%7C01%7C%7Cfdfeef1c7e564054376608d594c5f8ea%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636578500358126930&sdata=nu9YvbcMJc3rZD0VZbOsYbgATbw4GytKuw8y8M2j0r4%3D&reserved=0, or mute the threadhttps://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FANObjbhWGaj1l9DvLmtQrJxaJrqlVvKhks5ti7UwgaJpZM4STwdT&data=02%7C01%7C%7Cfdfeef1c7e564054376608d594c5f8ea%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C636578500358126930&sdata=fQmRYpNz9VugyuwIWPrQpAiflrx%2FvJ%2FiragEAfAcSVE%3D&reserved=0.
The test.record worked locally. I think it must be something to do with adding the environment variable, but unsure as to why.
@Joshbarrington's comment here is the right solution. I also made a pycocotools-2.0.tar.gz file which can be downloaded from here.
Note that the file I made is not guaranteed to be sync'ed to latest cocoapi, so you may still want to do it by yourselves:
Edit: copying from fchouteau@'s comment below.
Since gcloud ml engine has no python tk you also have to modify the imports in coco.py
import json
import time
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from matplotlib.collections import PatchCollection
from matplotlib.patches import Polygon
Hi,
In order to use the .tar.gz archive from @pkulzc you have to make the following modifications: Add "REQUIRED_PACKAGES = ['Cython>=0.28.1']" to setup.py -> The setup.py file in models/research/setup.py (you should also add matplotlib)
"""Setup script for object_detection."""
from setuptools import find_packages
from setuptools import setup
REQUIRED_PACKAGES = ['Pillow>=1.0', 'matplotlib','Cython>=0.28.1']
setup(
name='object_detection',
version='0.1',
install_requires=REQUIRED_PACKAGES,
include_package_data=True,
packages=[p for p in find_packages() if p.startswith('object_detection')],
description='Tensorflow Object Detection Library',
)
Since gcloud ml engine has no python tk you also have to modify the imports in coco.py
import json
import time
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from matplotlib.collections import PatchCollection
from matplotlib.patches import Polygon
And it should be working
@fchouteau Yes that's right. Our next release will include these changes.
Awesome, thanks for the solution! Is the recommendation still to use the 1.2 runtime as described in the docs or has this been tested on newer runtimes (1.4 or 1.5)?
I am currently running runtime 1.6 + python 2.7. You need runtime >= 1.4 for certain tf.contrib.data functions
Interesting. It's worth noting that 1.2 seems to still be the supported version for the ODAPI. 1.4 was at least for me creating severe/blocking errors in training. This isn't related, but is going to prevent me from running a more modern version than 1.2, though I'd like to for many reasons.
@pkulzc will your next release also be supported for ODAPI on GCE?
Running into this issue using TF 1.6 or 1.7 and python 2.7. Tried what @fchollet suggested and still getting an error on gcp:
Command '['pip', 'install', '--user', '--upgrade', '--force-reinstall', '--no-deps', u'pycocotoolsv2-2.0.tar.gz']' returned non-zero exit status 1
I'm using latest repo code.
@jhovell we will have a major release soon and that will support ODAPI on CMLE.
@aysark currently our API don't work with 1.2+ runtimes on CMLE due to a known grpc issue. You can either wait for our next release(likely in a month), or use my 1.2 compatible branch.
@pkulzc but i was able to successfully train it on 1.5 runtime? I just need to run eval job.
@aysark The issue in training randomly happens so you may get lucky. Do you have this pycocotoolsv2-2.0.tar.gz uploaded? Make sure the name is correct.
@pkulzc yes, i have it in my dist
folder and i send it as part of my job submission cmd:
--packages dist/object_detection-0.1.tar.gz,dist/pycocotoolsv2-2.0.tar.gz,slim/dist/slim-0.1.tar.gz
Do i need to also do the new installation step for COCO API in: https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/installation.md#coco-api-installation ? Thanks.
@aysark Hmm, how did you get this pycocotools package? Did you follow my comment here? And no, you don't need to do the installation as that is for local run.
@pkulzc yes i followed what you said, i am using your pycocotools-2.0.tar.gz
file actually, i just renamed it and i had to make the changes to coco.py
for matplotlib import to be:
import matplotlib
matplotlib.use('Agg')
The full stack log is:
INFO 2018-05-15 16:03:23 -0700 ps-replica-0 Installing the package: gs://infinitone/train/packages/b669fb8b24a29d19763394307b74c407c09e029ed3a4fe72a738d2187506f507/pycocotoolsv2-2.0.tar.gz
INFO 2018-05-15 16:03:23 -0700 ps-replica-0 Running command: pip install --user --upgrade --force-reinstall --no-deps pycocotoolsv2-2.0.tar.gz
ERROR 2018-05-15 16:03:23 -0700 ps-replica-0 Traceback (most recent call last):
ERROR 2018-05-15 16:03:23 -0700 ps-replica-0 File "<string>", line 1, in <module>
ERROR 2018-05-15 16:03:23 -0700 ps-replica-0 IOError: [Errno 2] No such file or directory: '/tmp/pip-req-build-0ubV1Q/setup.py'
INFO 2018-05-15 16:03:23 -0700 ps-replica-0 Complete output from command python setup.py egg_info:
INFO 2018-05-15 16:03:23 -0700 ps-replica-0 ----------------------------------------
ERROR 2018-05-15 16:03:24 -0700 ps-replica-0 Traceback (most recent call last):
ERROR 2018-05-15 16:03:24 -0700 ps-replica-0 File "<string>", line 1, in <module>
ERROR 2018-05-15 16:03:24 -0700 ps-replica-0 IOError: [Errno 2] No such file or directory: '/tmp/pip-req-build-qWLBY1/setup.py'
INFO 2018-05-15 16:03:24 -0700 ps-replica-0 ----------------------------------------
ERROR 2018-05-15 16:03:24 -0700 ps-replica-0 Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-req-build-qWLBY1/
INFO 2018-05-15 16:03:24 -0700 ps-replica-0 Clean up finished.
I think I was able to run eval without making changes to coco.py, could you please try use the original packet?
@pkulzc when i do just your original package, i get a diff error:
ERROR 2018-05-15 19:46:37 -0700 service The replica ps 2 exited with a non-zero status of 1. Termination reason: Error.
ERROR 2018-05-15 19:46:37 -0700 service Traceback (most recent call last):
ERROR 2018-05-15 19:46:37 -0700 service [...]
ERROR 2018-05-15 19:46:37 -0700 service File "/root/.local/lib/python2.7/site-packages/object_detection/evaluator.py", line 24, in <module>
ERROR 2018-05-15 19:46:37 -0700 service from object_detection import eval_util
ERROR 2018-05-15 19:46:37 -0700 service File "/root/.local/lib/python2.7/site-packages/object_detection/eval_util.py", line 28, in <module>
ERROR 2018-05-15 19:46:37 -0700 service from object_detection.metrics import coco_evaluation
ERROR 2018-05-15 19:46:37 -0700 service File "/root/.local/lib/python2.7/site-packages/object_detection/metrics/coco_evaluation.py", line 20, in <module>
ERROR 2018-05-15 19:46:37 -0700 service from object_detection.metrics import coco_tools
ERROR 2018-05-15 19:46:37 -0700 service File "/root/.local/lib/python2.7/site-packages/object_detection/metrics/coco_tools.py", line 47, in <module>
ERROR 2018-05-15 19:46:37 -0700 service from pycocotools import coco
ERROR 2018-05-15 19:46:37 -0700 service File "/root/.local/lib/python2.7/site-packages/pycocotools/coco.py", line 49, in <module>
ERROR 2018-05-15 19:46:37 -0700 service import matplotlib.pyplot as plt
ERROR 2018-05-15 19:46:37 -0700 service File "/root/.local/lib/python2.7/site-packages/matplotlib/pyplot.py", line 115, in <module>
ERROR 2018-05-15 19:46:37 -0700 service _backend_mod, new_figure_manager, draw_if_interactive, _show = pylab_setup()
ERROR 2018-05-15 19:46:37 -0700 service File "/root/.local/lib/python2.7/site-packages/matplotlib/backends/__init__.py", line 62, in pylab_setup
ERROR 2018-05-15 19:46:37 -0700 service [backend_name], 0)
ERROR 2018-05-15 19:46:37 -0700 service File "/root/.local/lib/python2.7/site-packages/matplotlib/backends/backend_tkagg.py", line 4, in <module>
ERROR 2018-05-15 19:46:37 -0700 service from . import tkagg # Paint image to Tk photo blitter extension.
ERROR 2018-05-15 19:46:37 -0700 service File "/root/.local/lib/python2.7/site-packages/matplotlib/backends/tkagg.py", line 5, in <module>
ERROR 2018-05-15 19:46:37 -0700 service from six.moves import tkinter as Tk
ERROR 2018-05-15 19:46:37 -0700 service File "/usr/local/lib/python2.7/dist-packages/six.py", line 203, in load_module
ERROR 2018-05-15 19:46:37 -0700 service mod = mod._resolve()
ERROR 2018-05-15 19:46:37 -0700 service File "/usr/local/lib/python2.7/dist-packages/six.py", line 115, in _resolve
ERROR 2018-05-15 19:46:37 -0700 service return _import_module(self.mod)
ERROR 2018-05-15 19:46:37 -0700 service File "/usr/local/lib/python2.7/dist-packages/six.py", line 82, in _import_module
ERROR 2018-05-15 19:46:37 -0700 service __import__(name)
ERROR 2018-05-15 19:46:37 -0700 service File "/usr/lib/python2.7/lib-tk/Tkinter.py", line 42, in <module>
ERROR 2018-05-15 19:46:37 -0700 service raise ImportError, str(msg) + ', please install the python-tk package'
ERROR 2018-05-15 19:46:37 -0700 service ImportError: No module named _tkinter, please install the python-tk package
My bad, you do need to matplotlib.use('Agg') to coco.py, sorry for the confusion.
From the error message this issue looks like a pip version issue and you probably want to talk to someone from GCP.
GCP has nothing to do with it...
Reverted back to an older branch and eval works, not sure why latest code regresses on basic functionality.
@aysark Which branch?
I've tried every suggestion here and have not been able to run the evaluation job successfully.
First I get import coco ImportError: No module named pycocotools
So I try including the package above, but that gets me ImportError: No module named _tkinter, please install the python-tk package
So I make the modifications to coco.py, and I get IOError: [Errno 2] No such file or directory: '/tmp/pip-req-build-qWLBY1/setup.py'
@pourhadi its not a specific branch, i just went back in time. My HEAD is acouple months back:
commit 3cb798fe02c9c627541e1d7f1816240a17dd02f3 (HEAD -> master)
Merge: f729a8c 95b0b03
Author: Chris Shallue <cshallue@users.noreply.github.com>
Date: Fri Jan 19 14:33:52 2018 -0800
Also note, i had to modify some files to fix some other things outlined in another issue (sorry can't recall which one)- but that applies only if you are doing object detection.
I'm able to successfully train with latest code, but i run my eval job with old code.
@fchouteau did you add MANIFEST.in?
Tried your solution but maskAPI.h was not included in pycocotools-2.0.tar.gz, when using python setup.py sdist and gcloud eval job won't work due to missing maskAPI.h. Hence: git clone https://github.com/cocodataset/cocoapi.git cd cocoapi/PythonAPI && mv ../common ./ / Update all files that has ../common reference - replace "../common" with "common" / / Add "REQUIRED_PACKAGES = ['Cython>=0.28.1']" to setup.py / /* update pycocotools/coco.py as layed out by @fchouteau echo "graft ./common/" > MANIFEST.in python setup.py sdist
Using setup.py sdist is probably closer to the way object detection is packaged.
@pkulzc Do you know of any updates on this for object detection with ml-engine? I was able to train successfully, but eval is failing similar to others. I'm using the modified pycoco library as well as a modified setup.py to include Cython. This got me to please install the python-tk package' ImportError: No module named _tkinter, please install the python-tk package
Is this something that should be added via the setup.py or should this come preinstalled on the cloud ml -engine instance image?
This is the command i'm running:
gcloud ml-engine jobs submit training `whoami`_object_detection_eval_`date +%s` \
--job-dir=${YOUR_GCS_BUCKET}/train \
--packages dist/object_detection-0.1.tar.gz,dist/pycocotools-2.0.tar.gz,slim/dist/slim-0.1.tar.gz \
--module-name object_detection.eval \
--runtime-version 1.5 \
--region us-central1 \
--scale-tier BASIC_GPU \
-- \
--checkpoint_dir=${YOUR_GCS_BUCKET}/train \
--eval_dir=${YOUR_GCS_BUCKET}/eval \
--pipeline_config_path=${YOUR_GCS_BUCKET}/data/faster_rcnn_resnet101_custom.config
from setuptools import find_packages
from setuptools import setup
#import os
REQUIRED_PACKAGES = ['Pillow>=1.0', 'protobuf>=3.3.0', 'Matplotlib>=2.1', 'Cython>=0.28.1']
#os.environ["PATH"] += os.pathsep + '/root/.local/bin'
setup(
name='object_detection',
version='0.1',
install_requires=REQUIRED_PACKAGES,
include_package_data=True,
packages=[p for p in find_packages() if p.startswith('object_detection')],
description='Tensorflow Object Detection Library',
)
@bryantharpe I fixed that issue by doing the same thing @chris1869 did, I'd say that's worth a try
@pourhadi I looked through that list and it looks like everything in --> https://storage.googleapis.com/object-detection-dogfood/data/pycocotools-2.0.tar.gz <-- here has the changes @chris1869 listed out. I tried extracting it and reruning setup.py on it then recompressing it but i'm still getting the same error? I've tried on both 1.5 and 1.6 runtime.
I'm guessing i'm missing something tiny?
Thanks @pkulzc and @pourhadi I forgot to add one import. It's running eval fine now.
Steps mentioned by @pkulzc helped to solve "No module named 'pycocotools'" issue, but after that I've got out of memory errors: 'replica ps 0 ran out-of-memory and exited with a non-zero status of 247.'
When I got out-of-memory in training I tweaked cloud.yml provided as a --config parameter and it worked. In evaluation this parameter seems not to be taken into account?
My cloud.yml configuration:
trainingInput: runtimeVersion: "1.5" scaleTier: CUSTOM masterType: complex_model_m_gpu workerCount: 3 workerType: complex_model_m_gpu parameterServerCount: 3 parameterServerType: complex_model_m
In eval you set scaleTier to BASIC_GPU instead of CUSTOM, so no config is needed. See https://cloud.google.com/ml-engine/reference/rest/v1/projects.jobs#scaletier
When following the step described here in the documentation an error is thrown about pycocotools being missing.
This issue is described here as well as on a Stack Overflow thread, but the workaround/fix described in either place is to install a plaform-specific version of Pycocotools locally. I'm skeptical that installing pycocotools on my Mac is going to fix this running in Google Cloud ML Engine. At the very least I'd expect to somehow have to bundle some Linux variant as a package along with my job. Is there any documentation how to achieve this or is this step currently broken with Google Cloud ML engine?