leondgarse / Keras_insightface

Insightface Keras implementation
MIT License
240 stars 56 forks source link

DNN library is not found issue #130

Open nezarelkadyy opened 10 months ago

nezarelkadyy commented 10 months ago

I have an issue regarding running a training code using CASIA-WebFace Dataset where It always gives me an error as follows:

2024-01-17 15:41:15.073468: E tensorflow/stream_executor/cuda/cuda_dnn.cc:398] Possibly insufficient driver version: 460.106.0
2024-01-17 15:41:15.073521: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at conv_ops.cc:1120 : UNIMPLEMENTED: DNN library is not found.
Traceback (most recent call last):
  File "/home/nezar/Synapse/Docxter/Training_Codes/Keras_insightface/run_train.py", line 23, in <module>
    tt.train(sch, 0)
  File "/home/nezar/Synapse/Docxter/Training_Codes/Keras_insightface/train.py", line 545, in train
    self.train_single_scheduler(**sch, initial_epoch=initial_epoch)
  File "/home/nezar/Synapse/Docxter/Training_Codes/Keras_insightface/train.py", line 525, in train_single_scheduler
    self.__basic_train__(initial_epoch + epoch, initial_epoch=initial_epoch)
  File "/home/nezar/Synapse/Docxter/Training_Codes/Keras_insightface/train.py", line 416, in __basic_train__
    self.model.fit(
  File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 67, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 54, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.UnimplementedError: Graph execution error:

Detected at node 'model/0_conv/Conv2D' defined at (most recent call last):
    File "/home/nezar/Synapse/Docxter/Training_Codes/Keras_insightface/run_train.py", line 23, in <module>
      tt.train(sch, 0)
    File "/home/nezar/Synapse/Docxter/Training_Codes/Keras_insightface/train.py", line 545, in train
      self.train_single_scheduler(**sch, initial_epoch=initial_epoch)
    File "/home/nezar/Synapse/Docxter/Training_Codes/Keras_insightface/train.py", line 525, in train_single_scheduler
      self.__basic_train__(initial_epoch + epoch, initial_epoch=initial_epoch)
    File "/home/nezar/Synapse/Docxter/Training_Codes/Keras_insightface/train.py", line 416, in __basic_train__
      self.model.fit(
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 64, in error_handler
      return fn(*args, **kwargs)
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/keras/engine/training.py", line 1409, in fit
      tmp_logs = self.train_function(iterator)
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/tensorflow/python/util/traceback_utils.py", line 150, in error_handler
      return fn(*args, **kwargs)
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 915, in __call__
      result = self._call(*args, **kwds)
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 980, in _call
      return self._stateless_fn(*args, **kwds)
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 2452, in __call__
      filtered_flat_args) = self._maybe_define_function(args, kwargs)
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 2711, in _maybe_define_function
      graph_function = self._create_graph_function(args, kwargs)
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 2627, in _create_graph_function
      func_graph_module.func_graph_from_py_func(
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/tensorflow/python/framework/func_graph.py", line 1141, in func_graph_from_py_func
      func_outputs = python_func(*func_args, **func_kwargs)
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 677, in wrapped_fn
      out = weak_wrapped_fn().__wrapped__(*args, **kwds)
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/tensorflow/python/framework/func_graph.py", line 1116, in autograph_handler
      return autograph.converted_call(
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/keras/engine/training.py", line 1051, in train_function
      return step_function(self, iterator)
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/keras/engine/training.py", line 1040, in step_function
      outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/tensorflow/python/distribute/distribute_lib.py", line 1312, in run
      return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/tensorflow/python/distribute/distribute_lib.py", line 2888, in call_for_each_replica
      return self._call_for_each_replica(fn, args, kwargs)
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/tensorflow/python/distribute/distribute_lib.py", line 3689, in _call_for_each_replica
      return fn(*args, **kwargs)
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/keras/engine/training.py", line 1030, in run_step
      outputs = model.train_step(data)
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/keras/engine/training.py", line 889, in train_step
      y_pred = self(x, training=True)
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 64, in error_handler
      return fn(*args, **kwargs)
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/keras/engine/training.py", line 490, in __call__
      return super().__call__(*args, **kwargs)
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 64, in error_handler
      return fn(*args, **kwargs)
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/keras/engine/base_layer.py", line 1014, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 92, in error_handler
      return fn(*args, **kwargs)
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/keras/engine/functional.py", line 458, in call
      return self._run_internal_graph(
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/keras/engine/functional.py", line 596, in _run_internal_graph
      outputs = node.layer(*args, **kwargs)
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 64, in error_handler
      return fn(*args, **kwargs)
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/keras/engine/base_layer.py", line 1014, in __call__
      outputs = call_fn(inputs, *args, **kwargs)
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 92, in error_handler
      return fn(*args, **kwargs)
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/keras/layers/convolutional/base_conv.py", line 250, in call
      outputs = self.convolution_op(inputs, self.kernel)
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/keras/layers/convolutional/base_conv.py", line 225, in convolution_op
      return tf.nn.convolution(
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/tensorflow/python/util/traceback_utils.py", line 150, in error_handler
      return fn(*args, **kwargs)
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/tensorflow/python/util/dispatch.py", line 1082, in op_dispatch_handler
      return dispatch_target(*args, **kwargs)
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/tensorflow/python/ops/nn_ops.py", line 1150, in convolution_v2
      return convolution_internal(
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/tensorflow/python/ops/nn_ops.py", line 1282, in convolution_internal
      return op(
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/tensorflow/python/ops/nn_ops.py", line 2756, in _conv2d_expanded_batch
      return gen_nn_ops.conv2d(
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 969, in conv2d
      _, _, _op, _outputs = _op_def_library._apply_op_helper(
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/tensorflow/python/framework/op_def_library.py", line 797, in _apply_op_helper
      op = g._create_op_internal(op_type_name, inputs, dtypes=None,
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/tensorflow/python/framework/func_graph.py", line 694, in _create_op_internal
      return super(FuncGraph, self)._create_op_internal(  # pylint: disable=protected-access
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 3754, in _create_op_internal
      ret = Operation(
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 2133, in __init__
      self._traceback = tf_stack.extract_stack_for_node(self._c_op)
    File "/home/nezar/.virtualenvs/venv_keras_arcface/lib/python3.8/site-packages/tensorflow/python/util/tf_stack.py", line 183, in extract_stack_for_node
      return _tf_stack.extract_stack_for_node(
Node: 'model/0_conv/Conv2D'
DNN library is not found.
     [[{{node model/0_conv/Conv2D}}]] [Op:__inference_train_function_27816]

============================================================================================ Noting that I have installed cuda11.2, cudnn 8.1, tensorflow 2.9.1, and tensorflow_addons 0.17.0 and the code used for training is as follows:

import tensorflow_addons as tfa
import train, losses, models
import os

data_basic_path = '/home/nezar/Data/v2/faces_webface_112x112'
data_path = data_basic_path + '_112x112_folders'
eval_paths = [os.path.join(data_basic_path, ii) for ii in ['lfw.bin', 'cfp_fp.bin', 'agedb_30.bin']]

""" First, Train with `lossTopK = 3` """
basic_model = models.buildin_models("r34", dropout=0, emb_shape=256, output_layer='E')
tt = train.Train(data_path, save_path='TT_resnet34_topk_bs256.h5', eval_paths=eval_paths,
                 basic_model=basic_model, model=None, lr_base=0.1, lr_decay=0.1, lr_decay_steps=[20, 30],
                 batch_size=16, random_status=0,
                 # output_wd_multiply=1
                 )

optimizer = tfa.optimizers.SGDW(learning_rate=0.1, weight_decay=5e-4, momentum=0.9)
sch = [
    {"loss": losses.ArcfaceLoss(scale=16), "epoch": 5, "optimizer": optimizer, "lossTopK": 3},
    {"loss": losses.ArcfaceLoss(scale=32), "epoch": 5, "lossTopK": 3},
    {"loss": losses.ArcfaceLoss(scale=64), "epoch": 40, "lossTopK": 3},
]
tt.train(sch, 0)

What could be the potential problem here?

nezarelkadyy commented 10 months ago

The dataset seems to be loaded successfully as well as the model itself as shown in the prints below that is generated by your code but it gives me the error I mentioned earlier in my question after these prints:

>>>> L2 regularizer value from basic_model: 0
>>>> Init type by loss function name...
>>>> Train arcface...
>>>> Init softmax dataset...
>>>> reloaded from dataset backup: faces_webface_112x112_112x112_folders_shuffle.npz
>>>> Loaded data image_names: 490623 image_classes: 490623 embeddings: 0 classes: 10572
>>>> Image length: 490623, Image class length: 490623, classes: 10572
>>>> Use specified optimizer: <tensorflow_addons.optimizers.weight_decay_optimizers.SGDW object at 0x7fb1f39c4ca0>
>>>> Append weight decay callback...
>>>> Add arcface layer, arc_kwargs={'loss_top_k': 3, 'append_norm': False, 'partial_fc_split': 0, 'name': 'arcface'}, vpl_kwargs={'vpl_lambda': 0.15, 'start_iters': -30663, 'allowed_delta': 200}...
>>>> loss_weights: {'arcface': 1}

Learning rate for iter 1 is 0.1
Weight decay is 0.0005000000237487257
Epoch 1/5
leondgarse commented 10 months ago