intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, MiniCPM, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, GraphRAG, DeepSpeed, vLLM, FastChat, Axolotl, etc.
Apache License 2.0
6.51k stars 1.24k forks source link

when i use my own datasets to train yolov3 model. (IndexError: list index out of range) #4803

Open xunaichao opened 2 years ago

xunaichao commented 2 years ago

when i use my own datasets to train yolov3 model, this error pops up, error: (bigdl) root@f20374a828c4:/opt/model_server.git/0b8a4a592dc149d79d8a18f33d3c6ef4/yoloV3# python /opt/yolov3/yolov3_bigdl.py --data_dir /opt/model_server.git/0b8a4a592dc149d79d8a18f33d3c6ef4/yoloV3 --weights /opt/model_server.git/resource/yolov3.weights --names /opt/model_server.git/0b8a4a592dc149d79d8a18f33d3c6ef4/cvat.names --class_num 2 2022-06-09 09:41:12.184323: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0. 2022-06-09 09:41:12.188888: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory 2022-06-09 09:41:12.188904: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. Initializing orca context Current pyspark location is : /usr/local/miniconda3/envs/bigdl/lib/python3.7/site-packages/pyspark/init.py Start to getOrCreate SparkContext pyspark_submit_args is: --driver-class-path /usr/local/miniconda3/envs/bigdl/lib/python3.7/site-packages/bigdl/share/core/lib/all-2.1.0-20220314.094552-2.jar:/usr/local/miniconda3/envs/bigdl/lib/python3.7/site-packages/bigdl/share/friesian/lib/bigdl-friesian-spark_2.4.6-2.1.0-SNAPSHOT-jar-with-dependencies.jar:/usr/local/miniconda3/envs/bigdl/lib/python3.7/site-packages/bigdl/share/dllib/lib/bigdl-dllib-spark_2.4.6-2.1.0-SNAPSHOT-jar-with-dependencies.jar:/usr/local/miniconda3/envs/bigdl/lib/python3.7/site-packages/bigdl/share/orca/lib/bigdl-orca-spark_2.4.6-2.1.0-SNAPSHOT-jar-with-dependencies.jar pyspark-shell 2022-06-09 09:41:15 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 2022-06-09 09:41:15 WARN Utils:66 - Service 'SparkUI' could not bind on port 4040. Attempting port 4041. 2022-06-09 09:41:16,793 Thread-4 WARN The bufferSize is set to 4000 but bufferedIo is false: false 2022-06-09 09:41:16,795 Thread-4 WARN The bufferSize is set to 4000 but bufferedIo is false: false 2022-06-09 09:41:16,795 Thread-4 WARN The bufferSize is set to 4000 but bufferedIo is false: false 2022-06-09 09:41:16,796 Thread-4 WARN The bufferSize is set to 4000 but bufferedIo is false: false 22-06-09 09:41:16 [Thread-4] INFO Engine$:121 - Auto detect executor number and executor cores number 22-06-09 09:41:16 [Thread-4] INFO Engine$:123 - Executor number is 1 and executor cores number is 4

User settings:

KMP_AFFINITY=granularity=fine,compact,1,0 KMP_BLOCKTIME=0 KMP_SETTINGS=1 OMP_NUM_THREADS=1

Effective settings:

KMP_ABORT_DELAY=0 KMP_ADAPTIVE_LOCK_PROPS='1,1024' KMP_ALIGN_ALLOC=64 KMP_ALL_THREADPRIVATE=416 KMP_ATOMIC_MODE=2 KMP_BLOCKTIME=0 KMP_CPUINFO_FILE: value is not defined KMP_DETERMINISTIC_REDUCTION=false KMP_DEVICE_THREAD_LIMIT=2147483647 KMP_DISP_HAND_THREAD=false KMP_DISP_NUM_BUFFERS=7 KMP_DUPLICATE_LIB_OK=false KMP_FORCE_REDUCTION: value is not defined KMP_FOREIGN_THREADS_THREADPRIVATE=true KMP_FORKJOIN_BARRIER='2,2' KMP_FORKJOIN_BARRIER_PATTERN='hyper,hyper' KMP_FORKJOIN_FRAMES=true KMP_FORKJOIN_FRAMES_MODE=3 KMP_GTID_MODE=3 KMP_HANDLE_SIGNALS=false KMP_HOT_TEAMS_MAX_LEVEL=1 KMP_HOT_TEAMS_MODE=0 KMP_INIT_AT_FORK=true KMP_ITT_PREPARE_DELAY=0 KMP_LIBRARY=throughput KMP_LOCK_KIND=queuing KMP_MALLOC_POOL_INCR=1M KMP_MWAIT_HINTS=0 KMP_NUM_LOCKS_IN_BLOCK=1 KMP_PLAIN_BARRIER='2,2' KMP_PLAIN_BARRIER_PATTERN='hyper,hyper' KMP_REDUCTION_BARRIER='1,1' KMP_REDUCTION_BARRIER_PATTERN='hyper,hyper' KMP_SCHEDULE='static,balanced;guided,iterative' KMP_SETTINGS=true KMP_SPIN_BACKOFF_PARAMS='4096,100' KMP_STACKOFFSET=64 KMP_STACKPAD=0 KMP_STACKSIZE=8M KMP_STORAGE_MAP=false KMP_TASKING=2 KMP_TASKLOOP_MIN_TASKS=0 KMP_TASK_STEALING_CONSTRAINT=1 KMP_TEAMS_THREAD_LIMIT=104 KMP_TOPOLOGY_METHOD=all KMP_USER_LEVEL_MWAIT=false KMP_USE_YIELD=1 KMP_VERSION=false KMP_WARNINGS=true OMP_AFFINITY_FORMAT='OMP: pid %P tid %i thread %n bound to OS proc set {%A}' OMP_ALLOCATOR=omp_default_mem_alloc OMP_CANCELLATION=false OMP_DEBUG=disabled OMP_DEFAULT_DEVICE=0 OMP_DISPLAY_AFFINITY=false OMP_DISPLAY_ENV=false OMP_DYNAMIC=false OMP_MAX_ACTIVE_LEVELS=2147483647 OMP_MAX_TASK_PRIORITY=0 OMP_NESTED=false OMP_NUM_THREADS='1' OMP_PLACES: value is not defined OMP_PROC_BIND='intel' OMP_SCHEDULE='static' OMP_STACKSIZE=8M OMP_TARGET_OFFLOAD=DEFAULT OMP_THREAD_LIMIT=2147483647 OMP_TOOL=enabled OMP_TOOL_LIBRARIES: value is not defined OMP_WAIT_POLICY=PASSIVE KMP_AFFINITY='noverbose,warnings,respect,granularity=fine,compact,1,0'

22-06-09 09:41:17 [Thread-4] INFO ThreadPool$:95 - Set mkl threads to 1 on thread 30 2022-06-09 09:41:17 WARN SparkContext:66 - Using an existing SparkContext; some configuration may not take effect. 22-06-09 09:41:17 [Thread-4] INFO Engine$:456 - Find existing spark context. Checking the spark conf... cls.getname: com.intel.analytics.bigdl.dllib.utils.python.api.Sample BigDLBasePickler registering: bigdl.dllib.utils.common Sample cls.getname: com.intel.analytics.bigdl.dllib.utils.python.api.EvaluatedResult BigDLBasePickler registering: bigdl.dllib.utils.common EvaluatedResult cls.getname: com.intel.analytics.bigdl.dllib.utils.python.api.JTensor BigDLBasePickler registering: bigdl.dllib.utils.common JTensor cls.getname: com.intel.analytics.bigdl.dllib.utils.python.api.JActivity BigDLBasePickler registering: bigdl.dllib.utils.common JActivity Successfully got a SparkContext 2022-06-09 09:41:19,965 INFO services.py:1340 -- View the Ray dashboard at http://172.26.0.8:8266 2022-06-09 09:41:19,970 WARNING services.py:1826 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=10.24gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM. {'node_ip_address': '172.26.0.8', 'raylet_ip_address': '172.26.0.8', 'redis_address': '172.26.0.8:49616', 'object_store_address': '/tmp/ray/session_2022-06-09_09-41-17_510479_2729502/sockets/plasma_store', 'raylet_socket_name': '/tmp/ray/session_2022-06-09_09-41-17_510479_2729502/sockets/raylet', 'webui_url': '172.26.0.8:8266', 'session_dir': '/tmp/ray/session_2022-06-09_09-41-17_510479_2729502', 'metrics_export_port': 56885, 'node_id': '0469d674a132d5d5b4423676a047f004d71990870d485a38644df9ad'} 2022-06-09 09:41:21.948240: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory 2022-06-09 09:41:21.948272: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303) 2022-06-09 09:41:21.948286: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (f20374a828c4): /proc/driver/nvidia/version does not exist 2022-06-09 09:41:21.948518: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. (raylet) /usr/local/miniconda3/envs/bigdl/lib/python3.7/site-packages/ray/dashboard/agent.py:152: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. (raylet) if LooseVersion(aiohttp.version) < LooseVersion("4.0.0"): {'dag': 0, 'cat': 1} /tmp/tmpy_e7k8rj/train_dataset /tmp/tmpy_e7k8rj/val_dataset Traceback (most recent call last): File "/opt/yolov3/yolov3_bigdl.py", line 743, in main() File "/opt/yolov3/yolov3_bigdl.py", line 652, in main splits_names=[(options.data_year, options.split_name_train)], classes=class_map) File "/usr/local/miniconda3/envs/bigdl/lib/python3.7/site-packages/bigdl/orca/data/image/parquet_dataset.py", line 337, in write_parquet func(output_path=output_path, *args, **kwargs) File "/usr/local/miniconda3/envs/bigdl/lib/python3.7/site-packages/bigdl/orca/data/image/parquet_dataset.py", line 303, in write_voc image, label = voc_datasets[0] File "/usr/local/miniconda3/envs/bigdl/lib/python3.7/site-packages/bigdl/orca/data/image/voc_dataset.py", line 82, in getitem img_id = self._imgid_items[idx] IndexError: list index out of range Stopping orca context

code is attached below data.zip

Le-Zheng commented 2 years ago

Hi xunaichao, This may be because the labels of the input data are inconsistent with the input class num. Need to check the input of voc.names and class_num.

xunaichao commented 2 years ago

@Le-Zheng we now know it's the dataset which matter, the annotations we use is gained from CVAT, and exported with the format of PASCAL VOC 1.1. but this kind of xml is don't work. what should I do. my own dataset is attached here: dog-cat.zip

Le-Zheng commented 2 years ago

hi xunaichao, Our example is based on standard voc dataset. It's required to process your data to the same VOC XML annotation to http://host.robots.ox.ac.uk/pascal/VOC/voc2009/VOCtrainval_11-May-2009.tar.