Closed valentinitnelav closed 1 year ago
I tried to debug this for 3 epochs using 2 GPUs, but I cannot reproduce the error. All worked without an error message, but a confusion matrix is not constructed, only the results.png which was not produced with the 300 epochs job.
#!/bin/bash
#SBATCH --job-name=debug_yolov4 # name for the job;
#SBATCH --partition=clara-job # Request for the Clara cluster;
#SBATCH --nodes=1 # Number of nodes;
#SBATCH --cpus-per-task=8 # Number of CPUs;
#SBATCH --gres=gpu:rtx2080ti:2 # Type and number of GPUs;
#SBATCH --mem-per-gpu=11G # RAM per GPU;
#SBATCH --time=01:00:00 # requested time in d-hh:mm:ss
#SBATCH --output=/home/sc.uni-leipzig.de/sv127qyji/PAI/detectors/logs_train_jobs/%j.log # path for job-id.log file;
#SBATCH --error=/home/sc.uni-leipzig.de/sv127qyji/PAI/detectors/logs_train_jobs/%j.err # path for job-id.err file;
#SBATCH --mail-type=BEGIN,TIME_LIMIT,END # email options;
# Delete any cache files in the train and val dataset folders that were created from previous jobs.
# This is important when ussing different YOLO versions.
# See https://github.com/WongKinYiu/yolov7/blob/main/README.md#training
rm --force ~/datasets/P1_Data_sampled/train/*.cache
rm --force ~/datasets/P1_Data_sampled/val/*.cache
# Start with a clean environment
module purge
# Load the needed modules from the software tree (same ones used when we created the environment)
module load Python/3.8.6-GCCcore-10.2.0
# Activate virtual environment
source ~/venv/PyTorch_YOLOv4/bin/activate
# Call the helper script session_info.sh which will print in the *.log file info
# about the used environment and hardware.
source ~/PAI/scripts/cluster/session_info.sh PyTorch_YOLOv4
# The first and only argument here, passed to $1, is the environment name set at ~/venv/
# Use source instead of bash, so that session_info.sh describes the environment activated in this script
# (the parent script from which is called). See https://askubuntu.com/a/965496/772524
# Train YOLO by calling train.py
cd ~/PAI/detectors/PyTorch_YOLOv4
python -m torch.distributed.launch --nproc_per_node 2 train.py \
--sync-bn \
--cfg ~/PAI/detectors/PyTorch_YOLOv4/cfg/yolov4-csp-s-leaky.cfg \
--weights ~/PAI/detectors/PyTorch_YOLOv4/weights/yolov4-csp-s-leaky.weights \
--data ~/PAI/scripts/config_yolov5.yaml \
--hyp ~/PAI/scripts/yolo_custom_hyp.yaml \
--epochs 3 \
--batch-size 16 \
--img-size 640 640 \
--workers 3 \
--name "$SLURM_JOB_ID"_debug_yolov4_pacsp_s_img640_b8_e3_hyp_custom
# Deactivate virtual environment
deactivate
The problem seems to be within the classes.
File "/PAI/detectors/PyTorch_YOLOv4/utils/plots.py", line 163, in plot_images cls = names[cls] if names else cls
Could it be that there are either to many or to few classes, classnames.... (e.g. Background)
@stark-t , i do not understand. Do you mean if there are images without label files? That should not happen, no. And I use the same yaml file for yolov5 & 7 and there was no class name problem.
I just looked into def plot_images
from utils/plots.py at the line cls = names[cls] if names else cls
. It seems that names is initialized with names=None
in the def
part of the function. Then indexing it would give the error IndexError: list index out of range
automatically if is empty, no? I don't fully understand the code.
names=None
#cls = int(classes[j])
cls = 1
cls = names[cls] if names else cls
# This to my surprise doesn't give an error
# But this gives an error:
names[cls]
# TypeError: 'NoneType' object is not subscriptable
We don't implement YOLOv4 any longer.
For PyTorch_YOLOv4, pacsp-s weights, job ID 3217130, I just noticed that it was interrupted after the last epoch with the error message posted below. It might be that some of the diagnostic plots didn't make it in the result folder at
PAI/detectors/PyTorch_YOLOv4/runs/train/yolov4_pacsp_s_b8_e300_img640_hyp_custom
.The job script is this one: https://github.com/stark-t/PAI/blob/12bdeb3daff116cd1fbc24eac74e99af5d48fc12/scripts/cluster/yolov4_train_pacsp_s_640_rtx.sh