tryolabs / luminoth

Deep Learning toolkit for Computer Vision.
https://tryolabs.com
BSD 3-Clause "New" or "Revised" License
2.4k stars 400 forks source link

Error while running on cloud #232

Closed AshwinAce closed 5 years ago

AshwinAce commented 5 years ago

This is the error shown in the error logs when I try to run on the cloud.

File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main "__main__", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/root/.local/lib/python2.7/site-packages/luminoth/train.py", line 330, in <module> train()
File "/root/.local/lib/python2.7/site-packages/click/core.py", line 722, in __call__ return self.main(*args, **kwargs) File "/root/.local/lib/python2.7/site-packages/click/core.py", line 697, in main rv = self.invoke(ctx)
File "/root/.local/lib/python2.7/site-packages/click/core.py", line 895, in invoke return ctx.invoke(self.callback, **ctx.params)
File "/root/.local/lib/python2.7/site-packages/click/core.py", line 535, in invoke return callback(*args, **kwargs)
File "/root/.local/lib/python2.7/site-packages/luminoth/train.py", line 296, in train config = get_config(config_files, override_params=override_params)
File "/root/.local/lib/python2.7/site-packages/luminoth/utils/config.py", line 17, in get_config model_base_config = get_base_config(model_class)
File "/root/.local/lib/python2.7/site-packages/luminoth/utils/config.py", line 63, in get_base_config return load_config_files([config_path])
File "/root/.local/lib/python2.7/site-packages/luminoth/utils/config.py", line 38, in load_config_files new_config = EasyDict(yaml.load(f))
File "/usr/local/lib/python2.7/dist-packages/yaml/__init__.py", line 69, in load loader = Loader(stream) File "/usr/local/lib/python2.7/dist-packages/yaml/loader.py", line 34, in __init__ Reader.__init__(self, stream)
File "/usr/local/lib/python2.7/dist-packages/yaml/reader.py", line 85, in __init__ self.determine_encoding()
File "/usr/local/lib/python2.7/dist-packages/yaml/reader.py", line 124, in determine_encoding self.update_raw()
File "/usr/local/lib/python2.7/dist-packages/yaml/reader.py", line 178, in update_raw data = self.stream.read(size)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/lib/io/file_io.py", line 125, in read self._preread_check()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/lib/io/file_io.py", line 85, in _preread_check compat.as_bytes(self.__name), 1024 * 512, status)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 519, in __exit__ c_api.TF_GetCode(self.status.status))
NotFoundError: /root/.local/lib/python2.7/site-packages/luminoth/models/fasterrcnn/base_config.yml; No such file or directory

After this the logs indicated that a cleanup was finished followed by this error

The replica master 0 exited with a non-zero status of 1. Traceback (most recent call last): [...] File "/root/.local/lib/python2.7/site-packages/click/core.py", line 535, in invoke return callback(*args, **kwargs)
File "/root/.local/lib/python2.7/site-packages/luminoth/train.py", line 296, in train config = get_config(config_files, override_params=override_params)
File "/root/.local/lib/python2.7/site-packages/luminoth/utils/config.py", line 17, in get_config model_base_config = get_base_config(model_class)
File "/root/.local/lib/python2.7/site-packages/luminoth/utils/config.py", line 63, in get_base_config return load_config_files([config_path])
File "/root/.local/lib/python2.7/site-packages/luminoth/utils/config.py", line 38, in load_config_files new_config = EasyDict(yaml.load(f))
File "/usr/local/lib/python2.7/dist-packages/yaml/__init__.py", line 69, in load loader = Loader(stream)
File "/usr/local/lib/python2.7/dist-packages/yaml/loader.py", line 34, in __init__ Reader.__init__(self, stream)
File "/usr/local/lib/python2.7/dist-packages/yaml/reader.py", line 85, in __init__ self.determine_encoding()
File "/usr/local/lib/python2.7/dist-packages/yaml/reader.py", line 124, in determine_encoding self.update_raw()
File "/usr/local/lib/python2.7/dist-packages/yaml/reader.py", line 178, in update_raw data = self.stream.read(size)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/lib/io/file_io.py", line 125, in read self._preread_check()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/lib/io/file_io.py", line 85, in _preread_check compat.as_bytes(self.__name), 1024 * 512, status)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 519, in __exit__ c_api.TF_GetCode(self.status.status))
NotFoundError: /root/.local/lib/python2.7/site-packages/luminoth/models/fasterrcnn/base_config.yml; No such file or directory To find out more about why your job exited please check the logs: https://console.cloud.google.com/logs/viewer?project=484504151094&resource=ml_job%2Fjob_id%2Ftrain_20181104_131838&advancedFilter=resource.type%3D%22ml_job%22%0Aresource.labels.job_id%3D%22train_20181104_131838%22

I tried running this an additional time, by changing the number of epochs in the fasterrcnn.py file to 100 instead of the default 1000, but that resulted in the same errors.

dekked commented 5 years ago

Related to #231.