Problem running post quantization fine-tuning

trivedisarthak commented 11 months ago

Hi,

I'm trying to convert YOLOv7 model trained on Crowdhuman dataset to HEF. I followed the optimization tutorial ; I'm trying to optimize the network with optimization level set to 2 with all the other options set according to the alls files provided in hailo model zoo for YOLOv7. I'm using the latest Hailo Software Suite - Docker image. I get the following error:

[info] Translation completed on ONNX model yolov7
[2023-07-17 10:01:27,146][hailo_sdk.client][INFO] - Translation completed on ONNX model yolov7
[info] Initialized runner for yolov7
[2023-07-17 10:01:27,703][hailo_sdk.client][INFO] - Initialized runner for yolov7
[info] Loading model script to yolov7 from string
[2023-07-17 10:01:31,186][hailo_sdk.client][INFO] - Loading model script to yolov7 from string
[info] Starting Model Optimization
[2023-07-17 10:03:11,265][hailo_sdk.client][IMPORTANT] - Starting Model Optimization
2023-07-17 10:03:11.617384: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-07-17 10:03:11.624954: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-07-17 10:03:11.625081: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-07-17 10:03:11.625693: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-07-17 10:03:11.626656: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-07-17 10:03:11.626763: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-07-17 10:03:11.626856: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-07-17 10:03:11.938253: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-07-17 10:03:11.938391: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-07-17 10:03:11.938492: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-07-17 10:03:11.938559: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:42] Overriding orig_value setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2023-07-17 10:03:11.938578: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 19157 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:08:00.0, compute capability: 8.6
[info] Using calibration set of 1500 entries
[2023-07-17 10:03:13,403][hailo_sdk.client][INFO] - Using calibration set of 1500 entries
[info] Assigning 16bit activation to output layer yolov7/output_layer3
[2023-07-17 10:03:13,405][hailo_sdk.client][INFO] - Assigning 16bit activation to output layer yolov7/output_layer3
[info] Assigning 16bit activation to output layer yolov7/output_layer2
[2023-07-17 10:03:13,407][hailo_sdk.client][INFO] - Assigning 16bit activation to output layer yolov7/output_layer2
[info] Starting auto 4bit weights
[2023-07-17 10:03:13,408][hailo_sdk.client][INFO] - Starting auto 4bit weights
[info] Assigning 4bit weights to layer yolov7/conv91 with 4719.62k parameters
[2023-07-17 10:03:13,412][hailo_sdk.client][INFO] - Assigning 4bit weights to layer yolov7/conv91 with 4719.62k parameters
[info] Assigning 4bit weights to layer yolov7/conv35 with 2359.81k parameters
[2023-07-17 10:03:13,412][hailo_sdk.client][INFO] - Assigning 4bit weights to layer yolov7/conv35 with 2359.81k parameters
[info] Assigning 4bit weights to layer yolov7/conv46 with 2359.81k parameters
[2023-07-17 10:03:13,412][hailo_sdk.client][INFO] - Assigning 4bit weights to layer yolov7/conv46 with 2359.81k parameters
[info] Ratio of weights in 4bit is 0.26
[2023-07-17 10:03:13,412][hailo_sdk.client][INFO] - Ratio of weights in 4bit is 0.26
[info] auto4bit completion time 00:00:00.00
[2023-07-17 10:03:13,412][hailo_sdk.client][INFO] - auto4bit completion time 00:00:00.00
[info] Auto 4bit weights is done
[2023-07-17 10:03:13,412][hailo_sdk.client][INFO] - Auto 4bit weights is done
[info] Starting Stats Collector
[2023-07-17 10:03:17,383][acceleras][INFO] - Starting Stats Collector
Calibration:   0%|                                                                                                                                                                                                                                     | 0/1500 [00:00<?, ?entries/s]2023-07-17 10:03:19.256664: I tensorflow/stream_executor/cuda/cuda_blas.cc:1786] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
2023-07-17 10:03:19.970407: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 8101
2023-07-17 10:03:20.385517: I tensorflow/core/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2023-07-17 10:03:20.386139: I tensorflow/core/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2023-07-17 10:03:20.386149: W tensorflow/stream_executor/gpu/asm_compiler.cc:80] Couldn't get ptxas version string: INTERNAL: Couldn't invoke ptxas --version
2023-07-17 10:03:20.386542: I tensorflow/core/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2023-07-17 10:03:20.386575: W tensorflow/stream_executor/gpu/redzone_allocator.cc:314] INTERNAL: Failed to launch ptxas
Relying on driver to perform ptx compilation. 
Modify $PATH to customize ptxas location.
This message will be only logged once.
2023-07-17 10:03:55.010791: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 314572800 exceeds 10% of free system memory.
2023-07-17 10:03:55.010824: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 314572800 exceeds 10% of free system memory.
Calibration: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1500/1500 [02:37<00:00,  9.55entries/s]
[info] Stats Collector is done (completion time is 00:02:38.38)
[2023-07-17 10:05:55,764][acceleras][INFO] - Stats Collector is done (completion time is 00:02:38.38)
[info] Bias Correction skipped
[2023-07-17 10:06:08,818][acceleras][INFO] - Bias Correction skipped
[info] Adaround skipped
[2023-07-17 10:06:08,821][acceleras][INFO] - Adaround skipped
[info] Starting Fine Tune
[2023-07-17 10:06:08,822][acceleras][INFO] - Starting Fine Tune
Epoch 1/6
2023-07-17 10:07:10.005476: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:903] layout failed: INVALID_ARGUMENT: Size of values 0 does not match size of permutation 4 @ fanin shape inSelectV2_2-2-TransposeNHWCToNCHW-LayoutOptimizer
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
error: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice
2023-07-17 10:07:16.352546: W tensorflow/core/framework/op_kernel.cc:1733] UNKNOWN: JIT compilation failed.
Error executing job with overrides: []
Traceback (most recent call last):
  File "convert.py", line 18, in main
    convert_obj.optimizer_har()
  File "convert.py", line 79, in optimizer_har
    self.runner.optimize(self.calib_dataset)
  File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_sdk_common/states/states.py", line 16, in wrapped_func
    return func(self, *args, **kwargs)
  File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_sdk_client/runner/client_runner.py", line 1783, in optimize
    self._optimize(calib_data, data_type=data_type, work_dir=work_dir)
  File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_sdk_common/states/states.py", line 16, in wrapped_func
    return func(self, *args, **kwargs)
  File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_sdk_client/runner/client_runner.py", line 1671, in _optimize
    self._sdk_backend.full_quantization(calib_data, data_type=data_type, work_dir=work_dir,
  File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_sdk_client/sdk_backend/sdk_backend.py", line 869, in full_quantization
    self._full_acceleras_run()
  File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_sdk_client/sdk_backend/sdk_backend.py", line 1019, in _full_acceleras_run
    optimization_flow.run()
  File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_model_optimization/flows/optimization_flow.py", line 101, in run
    self.post_quantization_optimization()
  File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_model_optimization/flows/optimization_flow.py", line 129, in post_quantization_optimization
    self._finetune()
  File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_model_optimization/flows/optimization_flow.py", line 265, in _finetune
    _, results = finetune.run()
  File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_model_optimization/algorithms/algorithm_base.py", line 119, in run
    self._run_int()
  File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_model_optimization/algorithms/finetune/qft.py", line 301, in _run_int
    self.run_qft(self._model_native, self._model, metrics=self.metrics)
  File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_model_optimization/algorithms/finetune/qft.py", line 358, in run_qft
    qft_distiller.fit(self.train_dataset, verbose=1,
  File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 67, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 54, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.UnknownError: Graph execution error:

Detected at node 'Adam/mod' defined at (most recent call last):
    File "convert.py", line 94, in <module>
      main()
    File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hydra/main.py", line 94, in decorated_main
      _run_hydra(
    File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
      _run_app(
    File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hydra/_internal/utils.py", line 457, in _run_app
      run_and_report(
    File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
      return func()
    File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
      lambda: hydra.run(
    File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 119, in run
      ret = run_job(
    File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hydra/core/utils.py", line 186, in run_job
      ret.return_value = task_function(task_cfg)
    File "convert.py", line 18, in main
      convert_obj.optimizer_har()
    File "convert.py", line 79, in optimizer_har
      self.runner.optimize(self.calib_dataset)
    File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_sdk_common/states/states.py", line 16, in wrapped_func
      return func(self, *args, **kwargs)
    File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_sdk_client/runner/client_runner.py", line 1783, in optimize
      self._optimize(calib_data, data_type=data_type, work_dir=work_dir)
    File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_sdk_common/states/states.py", line 16, in wrapped_func
      return func(self, *args, **kwargs)
    File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_sdk_client/runner/client_runner.py", line 1671, in _optimize
      self._sdk_backend.full_quantization(calib_data, data_type=data_type, work_dir=work_dir,
    File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_sdk_client/sdk_backend/sdk_backend.py", line 869, in full_quantization
      self._full_acceleras_run()
    File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_sdk_client/sdk_backend/sdk_backend.py", line 1019, in _full_acceleras_run
      optimization_flow.run()
    File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_model_optimization/flows/optimization_flow.py", line 101, in run
      self.post_quantization_optimization()
    File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_model_optimization/flows/optimization_flow.py", line 129, in post_quantization_optimization
      self._finetune()
    File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_model_optimization/flows/optimization_flow.py", line 265, in _finetune
      _, results = finetune.run()
    File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_model_optimization/algorithms/algorithm_base.py", line 119, in run
      self._run_int()
    File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_model_optimization/algorithms/finetune/qft.py", line 301, in _run_int
      self.run_qft(self._model_native, self._model, metrics=self.metrics)
    File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_model_optimization/algorithms/finetune/qft.py", line 358, in run_qft
      qft_distiller.fit(self.train_dataset, verbose=1,
    File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 64, in error_handler
      return fn(*args, **kwargs)
    File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/keras/engine/training.py", line 1409, in fit
      tmp_logs = self.train_function(iterator)
    File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/keras/engine/training.py", line 1051, in train_function
      return step_function(self, iterator)
    File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/keras/engine/training.py", line 1040, in step_function
      outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/keras/engine/training.py", line 1030, in run_step
      outputs = model.train_step(data)
    File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_model_optimization/acceleras/model/distiller.py", line 109, in train_step
      self.optimizer.apply_gradients(zip(gradients_f, trainable_vars_f))
    File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/keras/optimizers/optimizer_v2/optimizer_v2.py", line 672, in apply_gradients
      apply_state = self._prepare(var_list)
    File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/keras/optimizers/optimizer_v2/optimizer_v2.py", line 992, in _prepare
      self._prepare_local(var_device, var_dtype, apply_state)
    File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/keras/optimizers/optimizer_v2/adam.py", line 130, in _prepare_local
      super(Adam, self)._prepare_local(var_device, var_dtype, apply_state)
    File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/keras/optimizers/optimizer_v2/optimizer_v2.py", line 998, in _prepare_local
      lr_t = tf.identity(self._decayed_lr(var_dtype))
    File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/keras/optimizers/optimizer_v2/optimizer_v2.py", line 1056, in _decayed_lr
      lr_t = tf.cast(lr_t(local_step), var_dtype)
    File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_model_optimization/algorithms/finetune/qft.py", line 58, in __call__
      step = step % self.steps_per_epoch
Node: 'Adam/mod'
JIT compilation failed.
         [[{{node Adam/mod}}]] [Op:__inference_train_function_682919]

nadaved1 commented 11 months ago

Hi, Have checked that,you have installed all the prerequisite packages outside the dockers?

בתאריך יום ב׳, 17 ביולי 2023, 14:22, מאת Sarthak Trivedi ‏< @.***>:

Hi,

I'm trying to convert YOLOv7 model trained on Crowdhuman dataset to HEF. I followed the optimization tutorial https://hailo.ai/developer-zone/documentation/dataflow-compiler-v3-24-0/?sp_referrer=DFC_2_Model_Optimization_Tutorial.html ; I'm trying to optimize the network with optimization level set to 2 with all the other options set according to the alls files provided in hailo model zoo for YOLOv7. I'm using the latest Hailo Software Suite - Docker image. I get the following error:

[info] Translation completed on ONNX model yolov7[2023-07-17 10:01:27,146][hailo_sdk.client][INFO] - Translation completed on ONNX model yolov7[info] Initialized runner for yolov7[2023-07-17 10:01:27,703][hailo_sdk.client][INFO] - Initialized runner for yolov7[info] Loading model script to yolov7 from string[2023-07-17 10:01:31,186][hailo_sdk.client][INFO] - Loading model script to yolov7 from string[info] Starting Model Optimization[2023-07-17 10:03:11,265][hailo_sdk.client][IMPORTANT] - Starting Model Optimization2023-07-17 10:03:11.617384: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero2023-07-17 10:03:11.624954: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero2023-07-17 10:03:11.625081: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero2023-07-17 10:03:11.625693: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMATo enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.2023-07-17 10:03:11.626656: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero2023-07-17 10:03:11.626763: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero2023-07-17 10:03:11.626856: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero2023-07-17 10:03:11.938253: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero2023-07-17 10:03:11.938391: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero2023-07-17 10:03:11.938492: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero2023-07-17 10:03:11.938559: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:42] Overriding orig_value setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.2023-07-17 10:03:11.938578: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 19157 MB memory: -> device: 0, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:08:00.0, compute capability: 8.6[info] Using calibration set of 1500 entries[2023-07-17 10:03:13,403][hailo_sdk.client][INFO] - Using calibration set of 1500 entries[info] Assigning 16bit activation to output layer yolov7/output_layer3[2023-07-17 10:03:13,405][hailo_sdk.client][INFO] - Assigning 16bit activation to output layer yolov7/output_layer3[info] Assigning 16bit activation to output layer yolov7/output_layer2[2023-07-17 10:03:13,407][hailo_sdk.client][INFO] - Assigning 16bit activation to output layer yolov7/output_layer2[info] Starting auto 4bit weights[2023-07-17 10:03:13,408][hailo_sdk.client][INFO] - Starting auto 4bit weights[info] Assigning 4bit weights to layer yolov7/conv91 with 4719.62k parameters[2023-07-17 10:03:13,412][hailo_sdk.client][INFO] - Assigning 4bit weights to layer yolov7/conv91 with 4719.62k parameters[info] Assigning 4bit weights to layer yolov7/conv35 with 2359.81k parameters[2023-07-17 10:03:13,412][hailo_sdk.client][INFO] - Assigning 4bit weights to layer yolov7/conv35 with 2359.81k parameters[info] Assigning 4bit weights to layer yolov7/conv46 with 2359.81k parameters[2023-07-17 10:03:13,412][hailo_sdk.client][INFO] - Assigning 4bit weights to layer yolov7/conv46 with 2359.81k parameters[info] Ratio of weights in 4bit is 0.26[2023-07-17 10:03:13,412][hailo_sdk.client][INFO] - Ratio of weights in 4bit is 0.26[info] auto4bit completion time 00:00:00.00[2023-07-17 10:03:13,412][hailo_sdk.client][INFO] - auto4bit completion time 00:00:00.00[info] Auto 4bit weights is done[2023-07-17 10:03:13,412][hailo_sdk.client][INFO] - Auto 4bit weights is done[info] Starting Stats Collector[2023-07-17 10:03:17,383][acceleras][INFO] - Starting Stats CollectorCalibration: 0%| | 0/1500 [00:00<?, ?entries/s]2023-07-17 10:03:19.256664: I tensorflow/stream_executor/cuda/cuda_blas.cc:1786] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.2023-07-17 10:03:19.970407: I tensorflow/stream_executor/cuda/cuda_dnn.cc:384] Loaded cuDNN version 81012023-07-17 10:03:20.385517: I tensorflow/core/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory2023-07-17 10:03:20.386139: I tensorflow/core/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory2023-07-17 10:03:20.386149: W tensorflow/stream_executor/gpu/asm_compiler.cc:80] Couldn't get ptxas version string: INTERNAL: Couldn't invoke ptxas --version2023-07-17 10:03:20.386542: I tensorflow/core/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory2023-07-17 10:03:20.386575: W tensorflow/stream_executor/gpu/redzone_allocator.cc:314] INTERNAL: Failed to launch ptxasRelying on driver to perform ptx compilation. Modify $PATH to customize ptxas location.This message will be only logged once.2023-07-17 10:03:55.010791: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 314572800 exceeds 10% of free system memory.2023-07-17 10:03:55.010824: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 314572800 exceeds 10% of free system memory.Calibration: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1500/1500 [02:37<00:00, 9.55entries/s][info] Stats Collector is done (completion time is 00:02:38.38)[2023-07-17 10:05:55,764][acceleras][INFO] - Stats Collector is done (completion time is 00:02:38.38)[info] Bias Correction skipped[2023-07-17 10:06:08,818][acceleras][INFO] - Bias Correction skipped[info] Adaround skipped[2023-07-17 10:06:08,821][acceleras][INFO] - Adaround skipped[info] Starting Fine Tune[2023-07-17 10:06:08,822][acceleras][INFO] - Starting Fine TuneEpoch 1/62023-07-17 10:07:10.005476: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:903] layout failed: INVALID_ARGUMENT: Size of values 0 does not match size of permutation 4 @ fanin shape inSelectV2_2-2-TransposeNHWCToNCHW-LayoutOptimizererror: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdeviceerror: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdeviceerror: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdeviceerror: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdeviceerror: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdeviceerror: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdeviceerror: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdeviceerror: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdeviceerror: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdeviceerror: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdeviceerror: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdeviceerror: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdeviceerror: Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice2023-07-17 10:07:16.352546: W tensorflow/core/framework/op_kernel.cc:1733] UNKNOWN: JIT compilation failed.Error executing job with overrides: []Traceback (most recent call last): File "convert.py", line 18, in main convert_obj.optimizer_har() File "convert.py", line 79, in optimizer_har self.runner.optimize(self.calib_dataset) File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_sdk_common/states/states.py", line 16, in wrapped_func return func(self, *args, kwargs) File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_sdk_client/runner/client_runner.py", line 1783, in optimize self._optimize(calib_data, data_type=data_type, work_dir=work_dir) File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_sdk_common/states/states.py", line 16, in wrapped_func return func(self, *args, *kwargs) File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_sdk_client/runner/client_runner.py", line 1671, in _optimize self._sdk_backend.full_quantization(calib_data, data_type=data_type, work_dir=work_dir, File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_sdk_client/sdk_backend/sdk_backend.py", line 869, in full_quantization self._full_acceleras_run() File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_sdk_client/sdk_backend/sdk_backend.py", line 1019, in _full_acceleras_run optimization_flow.run() File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_model_optimization/flows/optimization_flow.py", line 101, in run self.post_quantization_optimization() File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_model_optimization/flows/optimization_flow.py", line 129, in post_quantization_optimization self._finetune() File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_model_optimization/flows/optimization_flow.py", line 265, in finetune , results = finetune.run() File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_model_optimization/algorithms/algorithm_base.py", line 119, in run self._run_int() File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_model_optimization/algorithms/finetune/qft.py", line 301, in _run_int self.run_qft(self._model_native, self._model, metrics=self.metrics) File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_model_optimization/algorithms/finetune/qft.py", line 358, in run_qft qft_distiller.fit(self.train_dataset, verbose=1, File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 67, in error_handler raise e.with_traceback(filtered_tb) from None File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 54, in quick_execute tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,tensorflow.python.framework.errors_impl.UnknownError: Graph execution error: Detected at node 'Adam/mod' defined at (most recent call last): File "convert.py", line 94, in main() File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hydra/main.py", line 94, in decorated_main _run_hydra( File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra _run_app( File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hydra/_internal/utils.py", line 457, in _run_app run_and_report( File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hydra/_internal/utils.py", line 220, in run_and_report return func() File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hydra/_internal/utils.py", line 458, in lambda: hydra.run( File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hydra/_internal/hydra.py", line 119, in run ret = run_job( File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hydra/core/utils.py", line 186, in run_job ret.return_value = task_function(task_cfg) File "convert.py", line 18, in main convert_obj.optimizer_har() File "convert.py", line 79, in optimizer_har self.runner.optimize(self.calib_dataset) File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_sdk_common/states/states.py", line 16, in wrapped_func return func(self, args, kwargs) File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_sdk_client/runner/client_runner.py", line 1783, in optimize self._optimize(calib_data, data_type=data_type, work_dir=work_dir) File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_sdk_common/states/states.py", line 16, in wrapped_func return func(self, *args, *kwargs) File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_sdk_client/runner/client_runner.py", line 1671, in _optimize self._sdk_backend.full_quantization(calib_data, data_type=data_type, work_dir=work_dir, File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_sdk_client/sdk_backend/sdk_backend.py", line 869, in full_quantization self._full_acceleras_run() File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_sdk_client/sdk_backend/sdk_backend.py", line 1019, in _full_acceleras_run optimization_flow.run() File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_model_optimization/flows/optimization_flow.py", line 101, in run self.post_quantization_optimization() File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_model_optimization/flows/optimization_flow.py", line 129, in post_quantization_optimization self._finetune() File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_model_optimization/flows/optimization_flow.py", line 265, in finetune , results = finetune.run() File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_model_optimization/algorithms/algorithm_base.py", line 119, in run self._run_int() File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_model_optimization/algorithms/finetune/qft.py", line 301, in _run_int self.run_qft(self._model_native, self._model, metrics=self.metrics) File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_model_optimization/algorithms/finetune/qft.py", line 358, in run_qft qft_distiller.fit(self.train_dataset, verbose=1, File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 64, in error_handler return fn(args, **kwargs) File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/keras/engine/training.py", line 1409, in fit tmp_logs = self.train_function(iterator) File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/keras/engine/training.py", line 1051, in train_function return step_function(self, iterator) File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/keras/engine/training.py", line 1040, in step_function outputs = model.distribute_strategy.run(run_step, args=(data,)) File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/keras/engine/training.py", line 1030, in run_step outputs = model.train_step(data) File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_model_optimization/acceleras/model/distiller.py", line 109, in train_step self.optimizer.apply_gradients(zip(gradients_f, trainable_vars_f)) File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/keras/optimizers/optimizer_v2/optimizer_v2.py", line 672, in apply_gradients apply_state = self._prepare(var_list) File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/keras/optimizers/optimizer_v2/optimizer_v2.py", line 992, in _prepare self._prepare_local(var_device, var_dtype, apply_state) File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/keras/optimizers/optimizer_v2/adam.py", line 130, in _prepare_local super(Adam, self)._prepare_local(var_device, var_dtype, apply_state) File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/keras/optimizers/optimizer_v2/optimizer_v2.py", line 998, in _prepare_local lr_t = tf.identity(self._decayed_lr(var_dtype)) File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/keras/optimizers/optimizer_v2/optimizer_v2.py", line 1056, in _decayed_lr lr_t = tf.cast(lr_t(local_step), var_dtype) File "/local/workspace/hailo_virtualenv/lib/python3.8/site-packages/hailo_model_optimization/algorithms/finetune/qft.py", line 58, in call step = step % self.steps_per_epochNode: 'Adam/mod'JIT compilation failed. [[{{node Adam/mod}}]] [Op:__inference_train_function_682919]

— Reply to this email directly, view it on GitHub https://github.com/hailo-ai/hailo_model_zoo/issues/60, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADBIQYHIXUI6OM5SPCOQ6Y3XQUN5XANCNFSM6AAAAAA2MZLIBM . You are receiving this because you are subscribed to this thread.Message ID: @.***>

trivedisarthak commented 11 months ago

Yes, I have installed the device drivers and I can correctly load and run the same script with optimization level 0; without fine-tuning and it works perfectly. I can also run the example demo for yolov7 using the optimized model.

Additionally, tensorflow installed in the container can see the gpu on the machine.

Screenshot from 2023-07-18 10-12-46

nadaved1 commented 11 months ago

Hi Sarthak, I specfically meant the Nvidia packages:

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \&& curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \&& curl -s -L \https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update sudo apt-get install -y nvidia-docker2 sudo systemctl restart docker

‫בתאריך יום ג׳, 18 ביולי 2023 ב-11:00 מאת ‪Sarthak Trivedi‬‏ <‪ @.***‬‏>:‬

Yes, I have installed the device drivers and I can correctly load and run the same script with optimization level 0; without fine-tuning and it works perfectly. I can also run the example demo https://github.com/hailo-ai/Hailo-Application-Code-Examples for yolov7 using the optimized model.

— Reply to this email directly, view it on GitHub https://github.com/hailo-ai/hailo_model_zoo/issues/60#issuecomment-1639714341, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADBIQYAMUYZJUOJRKZUJPY3XQY7C3ANCNFSM6AAAAAA2MZLIBM . You are receiving this because you commented.Message ID: @.***>

-- Regards, Nadav Eden

trivedisarthak commented 11 months ago

Yes, that is already installed. TensorFlow can access the GPU from inside the docker container.

nadaved1 commented 11 months ago

hmm.. What is the GPU model?

trivedisarthak commented 11 months ago

Okay; I tried downgrading the docker image from 2023.04 to 2022.10 and the fine-tuning works; Is it an internal bug with the Dataflow compiler library ? I'm using an RTX 3090.

nadaved1 commented 11 months ago

The docker image that you're reffering is the suite? If so, does it works on the latest 2023.07?

trivedisarthak commented 11 months ago

Yes I'm referring to the software suite docker image. I'll try it with 2023.07 and let you know.

hailo-ai / hailo_model_zoo

Problem running post quantization fine-tuning #60