hhk7734 / tensorflow-yolov4

YOLOv4 Implemented in Tensorflow 2.
MIT License
136 stars 75 forks source link

OOM when loading yolov4 for training on 8gb gpu. #75

Open necronomican opened 3 years ago

necronomican commented 3 years ago

Hello @hhk7734 On training the model on yolov4 and training it using yolov4.conv.137 weights file, I get an OOM. batch_size = 8 input_size = 416 I am using 2.1.0 version of yolov4 from this repo.

I have changed


physical_devices = tf.config.experimental.list_physical_devices("GPU")
if len(physical_devices) > 0:
      tf.config.experimental.set_memory_growth(physical_devices[0], True)
      tf.config.gpu.set_memory_growth(physical_devices[0],True)

to

try:
     tf.config.experimental.set_virtual_device_configuration(physical_devices[0], [
                    tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024 * 6)]
                )
except:
  pass

as set_memory_growth doesn't work.

following are the relevant dependencies.

tensorflow version is 2.2.0 cuda 10.2 cudnn 7.6.1

Here is the training script:

from tensorflow.keras import callbacks, optimizers
from yolov4.tf import SaveWeightsCallback, YOLOv4
import time

yolo = YOLOv4()
yolo.classes = "/home/user/datasets/YOLO_march3_2lines/data/voc.names"
yolo.input_size = 416
yolo.batch_size = 8

yolo.make_model()
yolo.load_weights(
    "/home/user/Downloads/yolov4.conv.137",
    weights_type="yolo"
    )

train_data_set = yolo.load_dataset(
    "/home/user/datasets/YOLO_march3_2lines/VOCdevkit/VOCPAN/ImageSets/Main/train_yolov4_py.txt",
    dataset_type="yolo",
    image_path_prefix="/home/user/datasets/pyyolo4_train_data",
    label_smoothing=0.05
)

val_data_set = yolo.load_dataset(
    "/home/user/datasets/YOLO_march3_2lines/VOCdevkit/VOCPAN/ImageSets/Main/val_yolov4_py.txt",
    dataset_type="yolo",
    image_path_prefix="/home/user/datasets/pyyolo4_train_data",
    training=False
)

epochs = 10
lr = 1e-4

optimizer = optimizers.Adam(learning_rate=lr)
yolo.compile(optimizer=optimizer, loss_iou_type="ciou")

def lr_scheduler(epoch):
    if epoch < int(epochs * 0.5):
        return lr
    if epoch < int(epochs * 0.8):
        return lr * 0.5
    if epoch < int(epochs * 0.9):
        return lr * 0.1
    return lr * 0.01

_callbacks = [
    callbacks.LearningRateScheduler(lr_scheduler),
    callbacks.TerminateOnNaN(),
    callbacks.TensorBoard(
        log_dir="/home/user/yolov4_text_crops/logs",
    ),
    SaveWeightsCallback(
        yolo=yolo, dir_path="/home/user/yolov4_text_crops/weights",
        weights_type="yolo", epoch_per_save=10
    ),
]

yolo.fit(
    train_data_set,
    epochs=epochs,
    callbacks=_callbacks,
    validation_data=val_data_set,
    validation_steps=50,
    validation_freq=5,
    steps_per_epoch=100,
)

Am I doing something wrong?

hhk7734 commented 3 years ago

For memory-related issues, I have to test, but I don't have time to spare. Sorry.. :cry: