marco-willi / camera-trap-classifier

Automatically identify animals in camera trap images by training a deep neural network.
MIT License
50 stars 17 forks source link

Test code fail with ValueError when running on Google Colab #23

Closed YunyiShen closed 4 years ago

YunyiShen commented 4 years ago

Thanks for the great software! I am trying to test it out on Google Colab, but there is an error and here is how to reproduce it: Runtime: Python3 GPU

This error also occurs when use ctc.train. All pipeline before training works as expected, so do two other tests.

Thanks!

YunyiShen commented 4 years ago

Update:

change line 105 of training/utils.py to res_size = tf.reshape( tf.shape(acc)[0] , [] ) and change line 322 of training/prepare_model.py to metrics=[accuracy])

can make the test work on Colab, however, reshape the acc in top-k won't help.

marco-willi commented 4 years ago

Hi YunyiShen,

Thanks for your interest in this repository. I tried to reproduce your errors using notebooks on Colab and found the following:

  1. The installation and the test seem to work on Colab with the CPU version of Tensorflow. These are the steps I took:
!pip install git+git://github.com/marco-willi/camera-trap-classifier.git#egg=camera_trap_classifier[tf]

%cd /usr/local/lib/python3.6/dist-packages/camera_trap_classifier/

!python -m unittest discover test/training

This produced the following (expected) output:

/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:523: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:524: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/dtypes.py:532: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
Testing Model: ResNet18
2020-01-04 00:52:23.889011: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
Epoch 1/2
1/1 [==============================] - 7s 7s/step - loss: 4.3768 - label/label/species_loss: 2.1320 - label/label/counts_loss: 1.2840 - label/label/species_accuracy: 0.5000 - label/label/species_top_k_accuracy: 0.5000 - label/label/counts_accuracy: 0.0000e+00 - label/label/counts_top_k_accuracy: 1.0000 - val_loss: 18.6228 - val_label/label/species_loss: 12.0886 - val_label/label/counts_loss: 5.5735 - val_label/label/species_accuracy: 0.2500 - val_label/label/species_top_k_accuracy: 0.5000 - val_label/label/counts_accuracy: 0.5000 - val_label/label/counts_top_k_accuracy: 1.0000
Epoch 2/2
...
  1. I can't reproduce the error you have reported with the GPU version of Tensorflow and a GPU runtime on Colab. Here is what I did:
!pip install git+git://github.com/marco-willi/camera-trap-classifier.git#egg=camera_trap_classifier[tf-gpu]

%cd /usr/local/lib/python3.6/dist-packages/camera_trap_classifier/

!python -m unittest discover test/training

I get the following errors:

E
======================================================================
ERROR: test_create_model (unittest.loader._FailedTest)
----------------------------------------------------------------------
ImportError: Failed to import test module: test_create_model
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/pywrap_tensorflow.py", line 58, in <module>
    from tensorflow.python.pywrap_tensorflow_internal import *
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 28, in <module>
    _pywrap_tensorflow_internal = swig_import_helper()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper
    _mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
  File "/usr/lib/python3.6/imp.py", line 243, in load_module
    return load_dynamic(name, filename, file)
  File "/usr/lib/python3.6/imp.py", line 343, in load_dynamic
    return _load(spec)
ImportError: libcublas.so.9.0: cannot open shared object file: No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/unittest/loader.py", line 428, in _find_test_path
    module = self._get_module_from_name(name)
  File "/usr/lib/python3.6/unittest/loader.py", line 369, in _get_module_from_name
    __import__(name)
  File "/usr/local/lib/python3.6/dist-packages/camera_trap_classifier/test/training/test_create_model.py", line 1, in <module>
    import tensorflow as tf
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/__init__.py", line 24, in <module>
    from tensorflow.python import pywrap_tensorflow  # pylint: disable=unused-import
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/__init__.py", line 49, in <module>
    from tensorflow.python import pywrap_tensorflow
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/pywrap_tensorflow.py", line 74, in <module>
    raise ImportError(msg)
ImportError: Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/pywrap_tensorflow.py", line 58, in <module>
    from tensorflow.python.pywrap_tensorflow_internal import *
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 28, in <module>
    _pywrap_tensorflow_internal = swig_import_helper()
  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/pywrap_tensorflow_internal.py", line 24, in swig_import_helper
    _mod = imp.load_module('_pywrap_tensorflow_internal', fp, pathname, description)
  File "/usr/lib/python3.6/imp.py", line 243, in load_module
    return load_dynamic(name, filename, file)
  File "/usr/lib/python3.6/imp.py", line 343, in load_dynamic
    return _load(spec)
ImportError: libcublas.so.9.0: cannot open shared object file: No such file or directory

Failed to load the native TensorFlow runtime.

See https://www.tensorflow.org/install/errors

for some common reasons and solutions.  Include the entire stack trace
above this error message when asking for help.

----------------------------------------------------------------------
Ran 1 test in 0.000s

FAILED (errors=1)

My error indicates that the CUDA drivers on Colab are too new and that Tensorflow 1.12 requires CUDA 9.0.

I verified the CUDA version on Colab with:

!cat /usr/local/cuda/version.txt

and found that it is: CUDA Version 10.0.130

My question would be: Can you verify that you have Tensorflow 1.12 installed?

If you have a newer version of Tensorflow the code might not be compatible. If that is the case, one solution would be to downgrade CUDA on Colab (which may not be possible) or use a compute instance outside of Colab where you can install the CUDA driver.

If you can use Docker you can expose a specific version of the CUDA driver, see here: https://github.com/marco-willi/camera-trap-classifier/blob/master/docs/Docker_GPU.md

Let me know if that is indeed the issue.

YunyiShen commented 4 years ago

Thanks for the quick reply!

EWARNING:tensorflow:From /usr/lib/python3.6/contextlib.py:60: TensorFlowTestCase.test_session (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version. Instructions for updating: Use self.session() or self.cached_session() instead. .

ERROR: testModelRuns (test_create_model.CreateModelTests) testModelRuns (test_create_model.CreateModelTests)

Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1607, in _create_c_op c_op = c_api.TF_FinishOperation(op_desc) tensorflow.python.framework.errors_impl.InvalidArgumentError: Shape must be rank 0 but is rank 1 for 'metrics/label/label/species_top_k_accuracy/cond/Switch' (op: 'Switch') with input shapes: [?], [?].

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/camera_trap_classifier/test/training/test_create_model.py", line 39, in testModelRuns output_loss_weights=None) File "/usr/local/lib/python3.6/dist-packages/camera_trap_classifier/training/prepare_model.py", line 322, in create_model metrics=[accuracy, top_k_accuracy]) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/tracking/base.py", line 457, in _method_wrapper result = method(self, *args, kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training.py", line 366, in compile masks=self._prepare_output_masks()) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training.py", line 2061, in _handle_metrics target, output, output_mask)) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training.py", line 2012, in _handle_per_output_metrics metric_fn, y_true, y_pred, weights=weights, mask=mask) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_utils.py", line 1065, in call_metric_function return metric_fn(y_true, y_pred, sample_weight=weights) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/metrics.py", line 194, in call replica_local_fn, *args, *kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/distribute/distributed_training_utils.py", line 1135, in call_replica_local_fn return fn(args, kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/metrics.py", line 177, in replica_local_fn update_op = self.update_state(*args, kwargs) # pylint: disable=not-callable File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/utils/metrics_utils.py", line 75, in decorated update_op = update_state_fn(args, kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/metrics.py", line 582, in update_state matches = self._fn(y_true, y_pred, self._fn_kwargs) File "/usr/local/lib/python3.6/dist-packages/camera_trap_classifier/training/utils.py", line 120, in top_k_accuracy lambda: acc) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func return func(args, kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/control_flow_ops.py", line 1212, in cond p_2, p_1 = switch(pred, pred) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/control_flow_ops.py", line 310, in switch return gen_control_flow_ops.switch(data, pred, name=name) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_control_flow_ops.py", line 937, in switch "Switch", data=data, pred=pred, name=name) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper op_def=op_def) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func return func(*args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op attrs, op_def, compute_device) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal op_def=op_def) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1770, in init control_input_ops) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1610, in _create_c_op raise ValueError(str(e)) ValueError: Shape must be rank 0 but is rank 1 for 'metrics/label/label/species_top_k_accuracy/cond/Switch' (op: 'Switch') with input shapes: [?], [?].


Ran 2 tests in 1.361s

FAILED (errors=1) ` and I am very curious why it works when using only accuracy as metric during model compiling, see my first comment. It may be a quick fix to make TensorFlow 1.15 (and thus the free Colab) usable.

marco-willi commented 4 years ago

Thanks for your feedback. I'll close the issue since it is related to version incompatibility. It may indeed be a quick fix to upgrade to TF 1.15, however, I currently don't have the time to do it myself and it is unclear whether or not further problems would arise.