tryolabs / luminoth

Deep Learning toolkit for Computer Vision.
https://tryolabs.com
BSD 3-Clause "New" or "Revised" License
2.4k stars 400 forks source link

Error while running on cloud, version 0.2.3.dev0 #237

Closed AshwinAce closed 5 years ago

AshwinAce commented 5 years ago

Error logs when run on the cloud:

File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "__main__", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals
File "/root/.local/lib/python2.7/site-packages/luminoth/train.py", line 330, in <module> train()
File "/root/.local/lib/python2.7/site-packages/click/core.py", line 722, in __call__ return self.main(*args, **kwargs)
File "/root/.local/lib/python2.7/site-packages/click/core.py", line 697, in main rv = self.invoke(ctx)
File "/root/.local/lib/python2.7/site-packages/click/core.py", line 895, in invoke return ctx.invoke(self.callback, **ctx.params)
File "/root/.local/lib/python2.7/site-packages/click/core.py", line 535, in invoke return callback(*args, **kwargs)
File "/root/.local/lib/python2.7/site-packages/luminoth/train.py", line 296, in train config = get_config(config_files, override_params=override_params)
File "/root/.local/lib/python2.7/site-packages/luminoth/utils/config.py", line 17, in get_config model_base_config = get_base_config(model_class)
File "/root/.local/lib/python2.7/site-packages/luminoth/utils/config.py", line 63, in get_base_config return load_config_files([config_path])
File "/root/.local/lib/python2.7/site-packages/luminoth/utils/config.py", line 38, in load_config_files new_config = EasyDict(yaml.load(f))
File "/usr/local/lib/python2.7/dist-packages/yaml/__init__.py", line 69, in load loader = Loader(stream) File "/usr/local/lib/python2.7/dist-packages/yaml/loader.py", line 34, in __init__ Reader.__init__(self, stream)
File "/usr/local/lib/python2.7/dist-packages/yaml/reader.py", line 85, in __init__ self.determine_encoding()
File "/usr/local/lib/python2.7/dist-packages/yaml/reader.py", line 124, in determine_encoding self.update_raw()
File "/usr/local/lib/python2.7/dist-packages/yaml/reader.py", line 178, in update_raw data = self.stream.read(size)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/lib/io/file_io.py", line 125, in read self._preread_check()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/lib/io/file_io.py", line 85, in _preread_check compat.as_bytes(self.__name), 1024 * 512, status)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 519, in __exit__ c_api.TF_GetCode(self.status.status))
NotFoundError: /root/.local/lib/python2.7/site-packages/luminoth/models/fasterrcnn/base_config.yml; No such file or directory

After this the logs indicated that a cleanup was finished followed by this error:

The replica master 0 exited with a non-zero status of 1. Traceback (most recent call last): [...] File "/root/.local/lib/python2.7/site-packages/click/core.py", line 535, in invoke return callback(*args, **kwargs)
File "/root/.local/lib/python2.7/site-packages/luminoth/train.py", line 296, in train config = get_config(config_files, override_params=override_params)
File "/root/.local/lib/python2.7/site-packages/luminoth/utils/config.py", line 17, in get_config model_base_config = get_base_config(model_class)
File "/root/.local/lib/python2.7/site-packages/luminoth/utils/config.py", line 63, in get_base_config return load_config_files([config_path])
File "/root/.local/lib/python2.7/site-packages/luminoth/utils/config.py", line 38, in load_config_files new_config = EasyDict(yaml.load(f))
File "/usr/local/lib/python2.7/dist-packages/yaml/__init__.py", line 69, in load loader = Loader(stream)
File "/usr/local/lib/python2.7/dist-packages/yaml/loader.py", line 34, in __init__ Reader.__init__(self, stream)
File "/usr/local/lib/python2.7/dist-packages/yaml/reader.py", line 85, in __init__ self.determine_encoding()
File "/usr/local/lib/python2.7/dist-packages/yaml/reader.py", line 124, in determine_encoding self.update_raw()
File "/usr/local/lib/python2.7/dist-packages/yaml/reader.py", line 178, in update_raw data = self.stream.read(size)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/lib/io/file_io.py", line 125, in read self._preread_check()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/lib/io/file_io.py", line 85, in _preread_check compat.as_bytes(self.__name), 1024 * 512, status)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 519, in __exit__ c_api.TF_GetCode(self.status.status))
NotFoundError: /root/.local/lib/python2.7/site-packages/luminoth/models/fasterrcnn/base_config.yml; No such file or directory To find out more about why your job exited please check the logs: https://console.cloud.google.com/logs/viewer?project=484504151094&resource=ml_job%2Fjob_id%2Ftrain_20181107_173730&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%22train_20181107_173730%22

As far as I can see, this is the same error as before, I am posting in case I have missed something. Before running, I had to move setup.py and README.md to a particular folder to avoid errors as I did in a previous error report. You had asked me to install version 0.2.2. However, this was the version that was eventually installed. I'm not sure to what extent that influenced these results.

dekked commented 5 years ago

What is the output of pip freeze | grep luminoth? How did you install Luminoth? Can you show the logs of the lumi gc cloud?

AshwinAce commented 5 years ago

luminoth==0.2.3.dev0

I installed by cloning the git repostory and running setup.py install

Here is a link to the logs:

https://console.cloud.google.com/logs/viewer?project=frcnn-test0&resource=ml_job%2Fjob_id%2Ftrain_20181107_173730&minLogLevel=0&expandAll=false&timestamp=2018-11-08T05:53:30.164000000Z&customFacets=&limitCustomFacetWidth=true&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%22train_20181107_173730%22&dateRangeEnd=2018-11-08T05:53:26.672Z&interval=PT1H&dateRangeUnbound=backwardInTime&scrollTimestamp=2018-11-07T22:45:57.881652130Z

dekked commented 5 years ago
  1. Delete the setup.py file that you had manually copied before, in your attempt to make the command work. According to issue #231, I guess this file is /home/ace/anaconda3/lib/python3.6/site-packages/setup.py. The presence of this file makes lumi gc cloud think that you are using editable mode (used only for development).
  2. Uninstall all versions of Luminoth: pip uninstall luminoth. Verify that lumi errors (command should not be found).
  3. Install 0.2.2 with pip: pip install luminoth

You shouldn't need to copy any file whatsoever to make cloud work. Let me know.

AshwinAce commented 5 years ago
/home/ace/anaconda3/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Bucket name not specified. Using "luminoth-107908267694143463013".
Generating "/tmp/tmp1_nm4rou/setup.py" for installing luminoth==0.2.2.
Traceback (most recent call last):
  File "/home/ace/anaconda3/bin/lumi", line 11, in <module>
    sys.exit(cli())
  File "/home/ace/anaconda3/lib/python3.6/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/home/ace/anaconda3/lib/python3.6/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/home/ace/anaconda3/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ace/anaconda3/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ace/anaconda3/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ace/anaconda3/lib/python3.6/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ace/anaconda3/lib/python3.6/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/home/ace/anaconda3/lib/python3.6/site-packages/luminoth/tools/cloud/gcloud.py", line 87, in decorated_function
    return f(*args, **kwargs)
  File "/home/ace/anaconda3/lib/python3.6/site-packages/luminoth/tools/cloud/gcloud.py", line 271, in train
    package_path = build_package(bucket, base_path)
  File "/home/ace/anaconda3/lib/python3.6/site-packages/luminoth/tools/cloud/gcloud.py", line 130, in build_package
    tarball_filename = os.listdir(output_dir)[0]
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmp1_nm4rou/output'

is the error that happens when I try to run. That was fixed by moving setup.py and README.md to a folder previously. I did get luminoth 0.2.2 installed as opposed to the dev version I had previously.

dekked commented 5 years ago

You should NOT move setup.py to that folder previously. Please delete those files (this is what I am saying in my previous comment).

As of your output, this is the line I cared the most about:

Generating "/tmp/tmp1_nm4rou/setup.py" for installing luminoth==0.2.2.

It means we are generating a stub for installing into Google Cloud. For some reason, the packaging looks to be failing.

Open the /home/ace/anaconda3/lib/python3.6/site-packages/luminoth/tools/cloud/gcloud.py file with a text editor, scroll to this block

    devnull = open(os.devnull, 'w')
    subprocess.call(
        ['python', 'setup.py', 'sdist', '--dist-dir', output_dir],
        cwd=package_path, stdout=devnull, stderr=devnull,
    )

And replace it with:

    devnull = open(os.devnull, 'w')
    subprocess.call(
        ['python', 'setup.py', 'sdist', '--dist-dir', output_dir],
        cwd=package_path
    )

Then try again, and show me the output of the command. Hopefully, with this we can debug why it fails for you :)

AshwinAce commented 5 years ago

When I said previously, I meant the previous times when I had this problem with other versions of luminoth. I did not move any files when I ran here. Sorry for the usage of ambiguous wording. I did the change you mentioned, here are the results.

Bucket name not specified. Using "luminoth-107908267694143463013". Generating "/tmp/tmpp1nf2qcy/setup.py" for installing luminoth==0.2.2. python: can't open file 'setup.py': [Errno 2] No such file or directory Traceback (most recent call last): File "/home/ace/anaconda3/bin/lumi", line 11, in sys.exit(cli()) File "/home/ace/anaconda3/lib/python3.6/site-packages/click/core.py", line 722, in call return self.main(args, kwargs) File "/home/ace/anaconda3/lib/python3.6/site-packages/click/core.py", line 697, in main rv = self.invoke(ctx) File "/home/ace/anaconda3/lib/python3.6/site-packages/click/core.py", line 1066, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/home/ace/anaconda3/lib/python3.6/site-packages/click/core.py", line 1066, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/home/ace/anaconda3/lib/python3.6/site-packages/click/core.py", line 1066, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/home/ace/anaconda3/lib/python3.6/site-packages/click/core.py", line 895, in invoke return ctx.invoke(self.callback, ctx.params) File "/home/ace/anaconda3/lib/python3.6/site-packages/click/core.py", line 535, in invoke return callback(args, *kwargs) File "/home/ace/anaconda3/lib/python3.6/site-packages/luminoth/tools/cloud/gcloud.py", line 87, in decorated_function return f(args, **kwargs) File "/home/ace/anaconda3/lib/python3.6/site-packages/luminoth/tools/cloud/gcloud.py", line 271, in train package_path = build_package(bucket, base_path) File "/home/ace/anaconda3/lib/python3.6/site-packages/luminoth/tools/cloud/gcloud.py", line 130, in build_package tarball_filename = os.listdir(output_dir)[0] FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpp1nf2qcy/output'

Bucket name not specified. Using "luminoth-107908267694143463013". Generating "/tmp/tmpp1nf2qcy/setup.py" for installing luminoth==0.2.2. python: can't open file 'setup.py': [Errno 2] No such file or directory Traceback (most recent call last): File "/home/ace/anaconda3/bin/lumi", line 11, in sys.exit(cli()) File "/home/ace/anaconda3/lib/python3.6/site-packages/click/core.py", line 722, in call return self.main(args, kwargs) File "/home/ace/anaconda3/lib/python3.6/site-packages/click/core.py", line 697, in main rv = self.invoke(ctx) File "/home/ace/anaconda3/lib/python3.6/site-packages/click/core.py", line 1066, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/home/ace/anaconda3/lib/python3.6/site-packages/click/core.py", line 1066, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/home/ace/anaconda3/lib/python3.6/site-packages/click/core.py", line 1066, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/home/ace/anaconda3/lib/python3.6/site-packages/click/core.py", line 895, in invoke return ctx.invoke(self.callback, ctx.params) File "/home/ace/anaconda3/lib/python3.6/site-packages/click/core.py", line 535, in invoke return callback(args, *kwargs) File "/home/ace/anaconda3/lib/python3.6/site-packages/luminoth/tools/cloud/gcloud.py", line 87, in decorated_function return f(args, **kwargs) File "/home/ace/anaconda3/lib/python3.6/site-packages/luminoth/tools/cloud/gcloud.py", line 271, in train package_path = build_package(bucket, base_path) File "/home/ace/anaconda3/lib/python3.6/site-packages/luminoth/tools/cloud/gcloud.py", line 130, in build_package tarball_filename = os.listdir(output_dir)[0] FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpp1nf2qcy/output'

dekked commented 5 years ago

Thanks for your help debugging this! It was all because of a misnamed variable in the cwd= argument. Should be fixed now.

You can install from the cloned repo with pip install .[gcloud] and let me know!

PD: Please use the code blocks since it makes multiline strings much more readable (and saves me from editing your issues :D).

AshwinAce commented 5 years ago

I cloned the repo and ran that command. This error vanished and the job is preparing currently. I will use code blocks in the future. I will report back in a few minutes.