mlcommons / training

Reference implementations of MLPerf™ training benchmarks
https://mlcommons.org/en/groups/training
Apache License 2.0
1.6k stars 553 forks source link

Stable diffusion training test failed at module 'cv2.dnn' has no attribute 'DictValue' #687

Closed billcsm closed 7 months ago

billcsm commented 11 months ago

I followed stable diffusion README.md to build the docker images and launch the container no issue. But when I did the training test:

./run_and_time.sh \ --num-nodes 1 \ --gpus-per-node 8 \ --checkpoint /checkpoints/sd/512-base-ema.ckpt \ --results-dir /results \ --config configs/train_01x08x08.yaml

The test failed at the following error:

STARTING TIMING RUN AT 2023-10-30 05:14:22 PM :::MLLOG {"namespace": "", "time_ms": 1698686065720, "event_type": "POINT_IN_TIME", "key": "cache_clear", "value": true, "metadata": {"file": "", "lineno": 4}} Traceback (most recent call last): File "main.py", line 37, in from ldm.data.base import Txt2ImgIterableBaseDataset File "/pwd/ldm/data/base.py", line 5, in import cv2 File "/usr/local/lib/python3.8/dist-packages/cv2/init.py", line 181, in bootstrap() File "/usr/local/lib/python3.8/dist-packages/cv2/init.py", line 175, in bootstrap if __load_extra_py_code_for_module("cv2", submodule, DEBUG): File "/usr/local/lib/python3.8/dist-packages/cv2/init.py", line 28, in load_extra_py_code_for_module py_module = importlib.import_module(module_name) File "/usr/lib/python3.8/importlib/init.py", line 127, in import_module return _bootstrap._gcd_import(name[level:], package, level) File "/usr/local/lib/python3.8/dist-packages/cv2/typing/init__.py", line 169, in LayerId = cv2.dnn.DictValue AttributeError: module 'cv2.dnn' has no attribute 'DictValue' root@86dc179b127c:/pwd#

Then I did further trouble shooting and found the issue was caused by albumentations. Before installing albumentations in container, I ran python and import cv2, everthing wroked fine. But after pip install albumentations, the failure was duplicated. Could someone give me a hint how to fix it? Thank you.

billcsm commented 9 months ago

I have solved this problem in the following steps:

  1. pip uninstall all opencv versions (found by pip list).
  2. remove the cv2 folder in system python3.8/dist-packages folder.
  3. pip install opencv-python and opencv-python-headless.
ahmadki commented 7 months ago

Thanks for the bug report @billcsm

https://github.com/mlcommons/training/pull/702 should fix the openCV issue