PyTorch_YOLOv4 - test.py & plots.py "IndexError: list index out of range" after last epoch

valentinitnelav commented 2 years ago

For PyTorch_YOLOv4, pacsp-s weights, job ID 3217130, I just noticed that it was interrupted after the last epoch with the error message posted below. It might be that some of the diagnostic plots didn't make it in the result folder at PAI/detectors/PyTorch_YOLOv4/runs/train/yolov4_pacsp_s_b8_e300_img640_hyp_custom.

Traceback (most recent call last):
  File "train.py", line 537, in <module>
    train(hyp, opt, device, tb_writer, wandb)
  File "train.py", line 336, in train
    results, maps, times = test.test(opt.data,
  File "/PAI/detectors/PyTorch_YOLOv4/test.py", line 226, in test
    plot_images(img, output_to_target(output, width, height), paths, f, names)  # predictions
  File "/PAI/detectors/PyTorch_YOLOv4/utils/plots.py", line 163, in plot_images
    cls = names[cls] if names else cls
IndexError: list index out of range
Traceback (most recent call last):
  File "/software/all/Python/3.8.6-GCCcore-10.2.0/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/software/all/Python/3.8.6-GCCcore-10.2.0/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/venv/PyTorch_YOLOv4/lib/python3.8/site-packages/torch/distributed/launch.py", line 261, in <module>
    main()
  File "/venv/PyTorch_YOLOv4/lib/python3.8/site-packages/torch/distributed/launch.py", line 256, in main
    raise subprocess.CalledProcessError(returncode=process.returncode,
subprocess.CalledProcessError: 
Command '['/venv/PyTorch_YOLOv4/bin/python', '-u', 'train.py', '--local_rank=7', '--sync-bn', '--cfg', '/PAI/detectors/PyTorch_YOLOv4/cfg/yolov4-csp-s-leaky.cfg', '--weights', '/PAI/detectors/PyTorch_YOLOv4/weights/yolov4-csp-s-leaky.weights', '--data', '/PAI/scripts/config_yolov5.yaml', '--hyp', '/PAI/scripts/yolo_custom_hyp.yaml', '--epochs', '300', '--batch-size', '64', '--img-size', '640', '640', '--workers', '3', '--name', 'yolov4_pacsp_s_b8_e300_img640_hyp_custom']' returned non-zero exit status 1.

The job script is this one: https://github.com/stark-t/PAI/blob/12bdeb3daff116cd1fbc24eac74e99af5d48fc12/scripts/cluster/yolov4_train_pacsp_s_640_rtx.sh

valentinitnelav commented 2 years ago

I tried to debug this for 3 epochs using 2 GPUs, but I cannot reproduce the error. All worked without an error message, but a confusion matrix is not constructed, only the results.png which was not produced with the 300 epochs job.

#!/bin/bash
#SBATCH --job-name=debug_yolov4 # name for the job;
#SBATCH --partition=clara-job # Request for the Clara cluster;
#SBATCH --nodes=1 # Number of nodes;
#SBATCH --cpus-per-task=8 # Number of CPUs;
#SBATCH --gres=gpu:rtx2080ti:2 # Type and number of GPUs;
#SBATCH --mem-per-gpu=11G # RAM per GPU;
#SBATCH --time=01:00:00 # requested time in d-hh:mm:ss
#SBATCH --output=/home/sc.uni-leipzig.de/sv127qyji/PAI/detectors/logs_train_jobs/%j.log # path for job-id.log file;
#SBATCH --error=/home/sc.uni-leipzig.de/sv127qyji/PAI/detectors/logs_train_jobs/%j.err # path for job-id.err file;
#SBATCH --mail-type=BEGIN,TIME_LIMIT,END # email options;

# Delete any cache files in the train and val dataset folders that were created from previous jobs.
# This is important when ussing different YOLO versions.
# See https://github.com/WongKinYiu/yolov7/blob/main/README.md#training
rm --force ~/datasets/P1_Data_sampled/train/*.cache
rm --force ~/datasets/P1_Data_sampled/val/*.cache

# Start with a clean environment
module purge
# Load the needed modules from the software tree (same ones used when we created the environment)
module load Python/3.8.6-GCCcore-10.2.0
# Activate virtual environment
source ~/venv/PyTorch_YOLOv4/bin/activate

# Call the helper script session_info.sh which will print in the *.log file info 
# about the used environment and hardware.
source ~/PAI/scripts/cluster/session_info.sh PyTorch_YOLOv4
# The first and only argument here, passed to $1, is the environment name set at ~/venv/
# Use source instead of bash, so that session_info.sh describes the environment activated in this script 
# (the parent script from which is called). See https://askubuntu.com/a/965496/772524

# Train YOLO by calling train.py
cd ~/PAI/detectors/PyTorch_YOLOv4
python -m torch.distributed.launch --nproc_per_node 2 train.py \
--sync-bn \
--cfg ~/PAI/detectors/PyTorch_YOLOv4/cfg/yolov4-csp-s-leaky.cfg \
--weights ~/PAI/detectors/PyTorch_YOLOv4/weights/yolov4-csp-s-leaky.weights \
--data ~/PAI/scripts/config_yolov5.yaml \
--hyp ~/PAI/scripts/yolo_custom_hyp.yaml \
--epochs 3 \
--batch-size 16 \
--img-size 640 640 \
--workers 3 \
--name "$SLURM_JOB_ID"_debug_yolov4_pacsp_s_img640_b8_e3_hyp_custom

# Deactivate virtual environment
deactivate

results

stark-t commented 2 years ago

The problem seems to be within the classes.

File "/PAI/detectors/PyTorch_YOLOv4/utils/plots.py", line 163, in plot_images cls = names[cls] if names else cls

Could it be that there are either to many or to few classes, classnames.... (e.g. Background)

valentinitnelav commented 2 years ago

@stark-t , i do not understand. Do you mean if there are images without label files? That should not happen, no. And I use the same yaml file for yolov5 & 7 and there was no class name problem.

I just looked into def plot_images from utils/plots.py at the line cls = names[cls] if names else cls. It seems that names is initialized with names=None in the def part of the function. Then indexing it would give the error IndexError: list index out of range automatically if is empty, no? I don't fully understand the code.

names=None
#cls = int(classes[j])
cls = 1
cls = names[cls] if names else cls
# This to my surprise doesn't give an error
# But this gives an error:
names[cls] 
# TypeError: 'NoneType' object is not subscriptable

valentinitnelav commented 1 year ago

We don't implement YOLOv4 any longer.

stark-t / PAI

PyTorch_YOLOv4 - test.py & plots.py "IndexError: list index out of range" after last epoch #47