Closed YunyiShen closed 4 years ago
Update:
change line 105 of training/utils.py to
res_size = tf.reshape( tf.shape(acc)[0] , [] )
and change line 322 of training/prepare_model.py to
metrics=[accuracy])
can make the test work on Colab, however, reshape the acc in top-k won't help.
Hi YunyiShen,
Thanks for your interest in this repository. I tried to reproduce your errors using notebooks on Colab and found the following:
!pip install git+git://github.com/marco-willi/camera-trap-classifier.git#egg=camera_trap_classifier[tf]
%cd /usr/local/lib/python3.6/dist-packages/camera_trap_classifier/
!python -m unittest discover test/training
This produced the following (expected) output:
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:523: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint8 = np.dtype([("qint8", np.int8, 1)])
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:524: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint16 = np.dtype([("qint16", np.int16, 1)])
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
_np_qint32 = np.dtype([("qint32", np.int32, 1)])
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:532: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
np_resource = np.dtype([("resource", np.ubyte, 1)])
Testing Model: ResNet18
2020-01-04 00:52:23.889011: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
Epoch 1/2
1/1 [==============================] - 7s 7s/step - loss: 4.3768 - label/label/species_loss: 2.1320 - label/label/counts_loss: 1.2840 - label/label/species_accuracy: 0.5000 - label/label/species_top_k_accuracy: 0.5000 - label/label/counts_accuracy: 0.0000e+00 - label/label/counts_top_k_accuracy: 1.0000 - val_loss: 18.6228 - val_label/label/species_loss: 12.0886 - val_label/label/counts_loss: 5.5735 - val_label/label/species_accuracy: 0.2500 - val_label/label/species_top_k_accuracy: 0.5000 - val_label/label/counts_accuracy: 0.5000 - val_label/label/counts_top_k_accuracy: 1.0000
Epoch 2/2
...
!pip install git+git://github.com/marco-willi/camera-trap-classifier.git#egg=camera_trap_classifier[tf-gpu]
%cd /usr/local/lib/python3.6/dist-packages/camera_trap_classifier/
!python -m unittest discover test/training
I get the following errors:
E
======================================================================
ERROR: test_create_model (unittest.loader._FailedTest)
----------------------------------------------------------------------
ImportError: Failed to import test module: test_create_model
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/pywrap_tensorflow.py", line 58, in <module>
from tensorflow.python.pywrap_tensorflow_internal import *
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 28, in <module>
_pywrap_tensorflow_internal = swig_import_helper()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper
_mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
File "/usr/lib/python3.6/imp.py", line 243, in load_module
return load_dynamic(name, filename, file)
File "/usr/lib/python3.6/imp.py", line 343, in load_dynamic
return _load(spec)
ImportError: libcublas.so.9.0: cannot open shared object file: No such file or directory
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.6/unittest/loader.py", line 428, in _find_test_path
module = self._get_module_from_name(name)
File "/usr/lib/python3.6/unittest/loader.py", line 369, in _get_module_from_name
__import__(name)
File "/usr/local/lib/python3.6/dist-packages/camera_trap_classifier/test/training/test_create_model.py", line 1, in <module>
import tensorflow as tf
File "/usr/local/lib/python3.6/dist-packages/tensorflow/__init__.py", line 24, in <module>
from tensorflow.python import pywrap_tensorflow # pylint: disable=unused-import
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/__init__.py", line 49, in <module>
from tensorflow.python import pywrap_tensorflow
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/pywrap_tensorflow.py", line 74, in <module>
raise ImportError(msg)
ImportError: Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/pywrap_tensorflow.py", line 58, in <module>
from tensorflow.python.pywrap_tensorflow_internal import *
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 28, in <module>
_pywrap_tensorflow_internal = swig_import_helper()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper
_mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
File "/usr/lib/python3.6/imp.py", line 243, in load_module
return load_dynamic(name, filename, file)
File "/usr/lib/python3.6/imp.py", line 343, in load_dynamic
return _load(spec)
ImportError: libcublas.so.9.0: cannot open shared object file: No such file or directory
Failed to load the native TensorFlow runtime.
See https://www.tensorflow.org/install/errors
for some common reasons and solutions. Include the entire stack trace
above this error message when asking for help.
----------------------------------------------------------------------
Ran 1 test in 0.000s
FAILED (errors=1)
My error indicates that the CUDA drivers on Colab are too new and that Tensorflow 1.12 requires CUDA 9.0.
I verified the CUDA version on Colab with:
!cat /usr/local/cuda/version.txt
and found that it is: CUDA Version 10.0.130
My question would be: Can you verify that you have Tensorflow 1.12 installed?
If you have a newer version of Tensorflow the code might not be compatible. If that is the case, one solution would be to downgrade CUDA on Colab (which may not be possible) or use a compute instance outside of Colab where you can install the CUDA driver.
If you can use Docker you can expose a specific version of the CUDA driver, see here: https://github.com/marco-willi/camera-trap-classifier/blob/master/docs/Docker_GPU.md
Let me know if that is indeed the issue.
Thanks for the quick reply!
I checked the tensorflow version with
import tensorflow as tf print(tf.__version__)
and it turns out being tf 1.15. Sadly I do not have access to other instances currently.
For 2, I figured out it is one of my mistakes, I did not force colab to install TensorFlow 1.12, by calling pip from python rather than bash, which causes the command after #
was not effective.
if you do not call pip
from bash, but call it in python instead, i.e. use pip ...
instead of !pip ...
, or !python -m pip ...
you can reproduce the error and I have the full message here:
`
/usr/local/lib/python3.6/dist-packages/camera_trap_classifier
Testing Model: ResNet18
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/resource_variable_ops.py:1630: calling BaseResourceVariable.init (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/array_ops.py:1475: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/camera_trap_classifier/training/utils.py:118: The name tf.is_nan is deprecated. Please use tf.math.is_nan instead.
self.session()
or self.cached_session()
instead.
.Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1607, in _create_c_op c_op = c_api.TF_FinishOperation(op_desc) tensorflow.python.framework.errors_impl.InvalidArgumentError: Shape must be rank 0 but is rank 1 for 'metrics/label/label/species_top_k_accuracy/cond/Switch' (op: 'Switch') with input shapes: [?], [?].
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/camera_trap_classifier/test/training/test_create_model.py", line 39, in testModelRuns output_loss_weights=None) File "/usr/local/lib/python3.6/dist-packages/camera_trap_classifier/training/prepare_model.py", line 322, in create_model metrics=[accuracy, top_k_accuracy]) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/tracking/base.py", line 457, in _method_wrapper result = method(self, *args, kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training.py", line 366, in compile masks=self._prepare_output_masks()) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training.py", line 2061, in _handle_metrics target, output, output_mask)) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training.py", line 2012, in _handle_per_output_metrics metric_fn, y_true, y_pred, weights=weights, mask=mask) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_utils.py", line 1065, in call_metric_function return metric_fn(y_true, y_pred, sample_weight=weights) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/metrics.py", line 194, in call replica_local_fn, *args, *kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/distribute/distributed_training_utils.py", line 1135, in call_replica_local_fn return fn(args, kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/metrics.py", line 177, in replica_local_fn update_op = self.update_state(*args, kwargs) # pylint: disable=not-callable File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/utils/metrics_utils.py", line 75, in decorated update_op = update_state_fn(args, kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/metrics.py", line 582, in update_state matches = self._fn(y_true, y_pred, self._fn_kwargs) File "/usr/local/lib/python3.6/dist-packages/camera_trap_classifier/training/utils.py", line 120, in top_k_accuracy lambda: acc) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func return func(args, kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/control_flow_ops.py", line 1212, in cond p_2, p_1 = switch(pred, pred) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/control_flow_ops.py", line 310, in switch return gen_control_flow_ops.switch(data, pred, name=name) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_control_flow_ops.py", line 937, in switch "Switch", data=data, pred=pred, name=name) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper op_def=op_def) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func return func(*args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op attrs, op_def, compute_device) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal op_def=op_def) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1770, in init control_input_ops) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1610, in _create_c_op raise ValueError(str(e)) ValueError: Shape must be rank 0 but is rank 1 for 'metrics/label/label/species_top_k_accuracy/cond/Switch' (op: 'Switch') with input shapes: [?], [?].
Ran 2 tests in 1.361s
FAILED (errors=1) ` and I am very curious why it works when using only accuracy as metric during model compiling, see my first comment. It may be a quick fix to make TensorFlow 1.15 (and thus the free Colab) usable.
Thanks for your feedback. I'll close the issue since it is related to version incompatibility. It may indeed be a quick fix to upgrade to TF 1.15, however, I currently don't have the time to do it myself and it is unclear whether or not further problems would arise.
Thanks for the great software! I am trying to test it out on Google Colab, but there is an error and here is how to reproduce it: Runtime: Python3 GPU
Install via pip:
pip install git+git://github.com/marco-willi/camera-trap-classifier.git # egg=camera_trap_classifier[tf-gpu]
Try the training test from bash:
%cd /usr/local/lib/python3.6/dist-packages/camera_trap_classifier/
!python -m unittest discover test/training
This will prduce an InvalidArgumentError with message:
ERROR: testModelRuns (test_create_model.CreateModelTests) testModelRuns (test_create_model.CreateModelTests) Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1607, in _create_c_op c_op = c_api.TF_FinishOperation(op_desc) tensorflow.python.framework.errors_impl.InvalidArgumentError: Shape must be rank 0 but is rank 1 for 'metrics/label/label/species_top_k_accuracy/cond/Switch' (op: 'Switch') with input shapes: [?], [?].
There is also another exemption, with much longer trace backs but problem occur in
/training/test_create_model.py, line 39, in testModelRuns output_loss_weights=None)
/training/prepare_model.py, line 322, in create_model metrics=[accuracy, top_k_accuracy])
/training/tracking/base.py", line 457, in _method_wrapper result = method(self, *args, **kwargs)' before trace went to
tensorflowand produce a VelueError with message:
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1610, in _create_c_op raise ValueError(str(e)) ValueError: Shape must be rank 0 but is rank 1 for 'metrics/label/label/species_top_k_accuracy/cond/Switch' (op: 'Switch') with input shapes: [?], [?]. `This error also occurs when use
ctc.train
. All pipeline before training works as expected, so do two other tests.Thanks!