tensorflow / models

Models and examples built with TensorFlow
Other
77.16k stars 45.77k forks source link

Can't train detector model using Google Cloud #2890

Closed lucasharada closed 5 years ago

lucasharada commented 6 years ago

I'm trying to train my own Detector model based on Tensorflow sample and this post. And I did succeed on training locally on a Macbook Pro. The problem is that I don't have a GPU and doing it on the CPU is too slow (about 25-40s per iteration).

This way, I'm trying to run on Google Cloud ML Engine following the tutorial, but I can't make it run properly.

My folder structures is described below:

+ data
 - train.record
 - test.record
+ models
 + train
 + eval
+ training
 - ssd_mobilenet_v1_coco

My steps to change from local training to Google Cloud training were:

  1. Create a bucket in Google Cloud storage and copy my local folder structure with files;
  2. Edit my pipeline.config file and change all paths from Users/dev/detector/ to gcc://bucketname/;
  3. Create a YAML file with the default configuration provided in the tutorial;
  4. Run
    gcloud ml-engine jobs submit training object_detection_date +%s \ 
    --job-dir=gs://bucketname/models/train \ 
    --packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz \ 
    --module-name object_detection.train \ 
    --region us-east1 \ 
    --config /Users/dev/detector/training/cloud.yml \ 
    -- \ 
    --train_dir=gs://bucketname/models/train \ 
    --pipeline_config_path=gs://bucketname/data/pipeline.config

    Doing so, gives me the following error message from the MLUnits:

The replica ps 0 exited with a non-zero status of 1. Termination reason: Error. Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main "main", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 49, in from object_detection import trainer File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 27, in from object_detection.builders import preprocessor_builder File "/root/.local/lib/python2.7/site-packages/object_detection/builders/preprocessor_builder.py", line 21, in from object_detection.protos import preprocessor_pb2 File "/root/.local/lib/python2.7/site-packages/object_detection/protos/preprocessor_pb2.py", line 71, in options=None, file=DESCRIPTOR), TypeError: new() got an unexpected keyword argument 'file'

Thanks in advance.

law826 commented 6 years ago

Make sure you have run the following from the models/research/ directory before running setup.py

export PYTHONPATH=$PYTHONPATH:pwd:pwd/slim

lucasharada commented 6 years ago

@law826 I already did it. Unfortunatelly, getting the same error.

macro-dadt commented 6 years ago

i heard someone fixed it by modifying setup.py. Maybe this will help

aselle commented 6 years ago

Did @macro-dadt, suggestion help?

jeffrwatts commented 6 years ago

I'm also trying what @lucasharada is doing (training with my own dataset). I'm running fine locally on macbook pro (although very slow... approx 25 sec per step). I get the exact same error when trying to run Google Clound Engine.

I did try the link @macro-dadt suggested and did not have any luck (resulted in the same error).

Happy to provide more information if it helps diagnose what is happening.

cclough commented 6 years ago

I am getting this problem too!

andersskog commented 6 years ago

Now that TF 1.4 is available on CloudML Engine, the following changes fix this problem:

Make sure your yaml version is 1.4, eg:

trainingInput:
  runtimeVersion: "1.4"
  scaleTier: CUSTOM
  masterType: standard_gpu
  workerCount: 5
  workerType: standard_gpu
  parameterServerCount: 3
  parameterServerType: standard

Change setup.py to the following:

"""Setup script for object_detection."""

import logging
import subprocess
from setuptools import find_packages
from setuptools import setup
from setuptools.command.install import install

class CustomCommands(install):

    def RunCustomCommand(self, command_list):
        p = subprocess.Popen(
        command_list,
        stdin=subprocess.PIPE,
        stdout=subprocess.PIPE,
        stderr=subprocess.STDOUT)
        stdout_data, _ = p.communicate()
        logging.info('Log command output: %s', stdout_data)
        if p.returncode != 0:
            raise RuntimeError('Command %s failed: exit code: %s' %
                         (command_list, p.returncode))

    def run(self):
        self.RunCustomCommand(['apt-get', 'update'])
        self.RunCustomCommand(
          ['apt-get', 'install', '-y', 'python-tk'])
        install.run(self)

REQUIRED_PACKAGES = ['Pillow>=1.0', 'protobuf>=3.3.0', 'Matplotlib>=2.1']

setup(
    name='object_detection',
    version='0.1',
    install_requires=REQUIRED_PACKAGES,
    include_package_data=True,
    packages=[p for p in find_packages() if p.startswith('object_detection')],
    description='Tensorflow Object Detection Library',
 cmdclass={
        'install': CustomCommands,
    }
)

In object_detection/utils/visualization_utils.py, line 24 (before import matplotlib.pyplot as plt) add:

import matplotlib
matplotlib.use('agg')

In line 184 of object_detection/evaluator.py, change

tf.train.get_or_create_global_step()

to

tf.contrib.framework.get_or_create_global_step()

Finally, in line 103 of object_detection/builders/optimizer_builder.py, change

tf.train.get_or_create_global_step()

to

tf.contrib.framework.get_or_create_global_step()

Hope this helps!

Janekxyz commented 6 years ago

@andersskog Your answer doesn't work in my case.

puma007 commented 6 years ago

@andersskog I have try your answer,but only run a few steps, it throws out of memory error, image resolution is not large, less than 600

puma007 commented 6 years ago

@lucasharada I have the same error, have you solved it?

puma007 commented 6 years ago

@aselle I have the same error, is there a solution now? Thanks!I have set the gcloud command line with runtime-version=1.4,and the yml file is also set runtimeVersion: "1.4",but have the same error orker-replica-1 Traceback (most recent call last): File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 49, in <module> from object_detection import trainer File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 27, in <module> from object_detection.builders import preprocessor_builder File "/root/.local/lib/python2.7/site-packages/object_detection/builders/preprocessor_builder.py", line 21, in <module> from object_detection.protos import preprocessor_pb2 File "/root/.local/lib/python2.7/site-packages/object_detection/protos/preprocessor_pb2.py", line 71, in <module> options=None, file=DESCRIPTOR), TypeError: __new__() got an unexpected keyword argument 'file'

ttungl commented 6 years ago

@lucasharada got the same error, any ideas?

mrainezty commented 6 years ago

l have trained my model successfully on my macbookpro, but l cannot do the same thing on google cloud, l tried all the methods mentioned above, but l cannot make it now.

zyxcambridge commented 6 years ago

https://github.com/tensorflow/models/pull/3490

can help you

zyxcambridge commented 6 years ago

maxwang7 added some commits on 28 Feb @maxwang7 annotates / fixes tutorial instructions 199e254 @maxwang7 fixes tf_example_decoder.py 82857bd @maxwang7 adds dependencies 5ffed73 @maxwang7 FOR DEMONSTRATION ONLY; NOT FOR PUSHING … 2d76dce @maxwang7 maxwang7 requested review from derekjchow and jch1 as code owners on 28 Feb

这个人 修改的四个类 ,靠谱

ymodak commented 5 years ago

Closing since this is resolved. Feel free to reopen if the issue still persists. Thanks!