hhk7734 / tensorflow-yolov4

YOLOv4 Implemented in Tensorflow 2.
MIT License
136 stars 75 forks source link

segmentation fault at training if classes too few #81

Open tino926 opened 3 years ago

tino926 commented 3 years ago

Hi, I followed https://wiki.loliot.net/docs/lang/python/libraries/yolov4/python-yolov4-edge-tpu/ to train a model with only one class.

If I use the original yolov4-tiny.cfg, the training works normally.

However, if I set classes<59 in the .cfg file, I received such error:

2021-04-20 14:54:05.484701: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcupti.so.10.1
2021-04-20 14:54:05.601503: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1513] CUPTI activity buffer flushed
Segmentation fault (core dumped)

Maybe this is a bug? This is the training script is used:

from tensorflow.keras import callbacks

from yolov4.tf import YOLOv4, YOLODataset, SaveWeightsCallback

import os

yolo = YOLOv4()

yolo.config.parse_names("high_top_4.names")
yolo.config.parse_cfg("yolov4-tiny-relu_1.cfg")

yolo.make_model()
# yolo.load_weights(
#     "yolov4-tiny.conv.29",
#     weights_type="yolo",
# )
yolo.summary(summary_type="yolo")

# for i in range(29):
#     yolo.model.get_layer(index=i).trainable = False

yolo.summary()

train_dataset = YOLODataset(
    config=yolo.config,
    dataset_list="train_high_top_4.txt",
    image_path_prefix="./",
    training=True,
)

val_dataset = YOLODataset(
    config=yolo.config,
    dataset_list="val_high_top_4.txt",
    image_path_prefix="./",
    training=False,
)

yolo.compile()

_callbacks = [
    callbacks.TerminateOnNaN(),
    callbacks.TensorBoard(
        log_dir="./logs",
        update_freq=200,
        histogram_freq=1,
    ),
    SaveWeightsCallback(
        yolo=yolo,
        dir_path="./trained",
        weights_type="yolo",
        step_per_save=2000,
    ),
]

yolo.fit(
    train_dataset,
    callbacks=_callbacks,
    validation_data=val_dataset,
    verbose=3,  # 3: print step info
)

and the "yolov4-tiny-relu_1.cfg" contains:

[net]
batch=32
width=416
height=416
channels=3

learning_rate=0.00261
burn_in=1000

max_batches=240000
policy=steps
steps=192000,216000
scales=.1,.1

mosaic=1

[convolutional]
batch_normalize=1
filters=32
size=3
stride=2
pad=1
activation=relu

[convolutional]
batch_normalize=1
filters=64
size=3
stride=2
pad=1
activation=relu

[convolutional]
batch_normalize=1
filters=64
size=3
stride=1
pad=1
activation=relu

[route]
layers=-1
groups=2
group_id=1

[convolutional]
batch_normalize=1
filters=32
size=3
stride=1
pad=1
activation=relu

[convolutional]
batch_normalize=1
filters=32
size=3
stride=1
pad=1
activation=relu

[route]
layers=-1,-2

[convolutional]
batch_normalize=1
filters=64
size=1
stride=1
pad=1
activation=relu

[route]
layers=-6,-1

[maxpool]
size=2
stride=2

[convolutional]
batch_normalize=1
filters=128
size=3
stride=1
pad=1
activation=relu

[route]
layers=-1
groups=2
group_id=1

[convolutional]
batch_normalize=1
filters=64
size=3
stride=1
pad=1
activation=relu

[convolutional]
batch_normalize=1
filters=64
size=3
stride=1
pad=1
activation=relu

[route]
layers=-1,-2

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=relu

[route]
layers=-6,-1

[maxpool]
size=2
stride=2

[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=relu

[route]
layers=-1
groups=2
group_id=1

[convolutional]
batch_normalize=1
filters=128
size=3
stride=1
pad=1
activation=relu

[convolutional]
batch_normalize=1
filters=128
size=3
stride=1
pad=1
activation=relu

[route]
layers=-1,-2

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=relu

[route]
layers=-6,-1

[maxpool]
size=2
stride=2

[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=relu

##################################

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=relu

[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=relu

[convolutional]
size=1
stride=1
pad=1
filters=255
activation=linear

[yolo]
mask=3,4,5
anchors=10,14, 23,27, 37,58, 81,82, 135,169, 344,319
num=6
scale_x_y=1.05
classes=1
iou_thresh=0.213
iou_loss=ciou
iou_normalizer=0.7
obj_normalizer=1.0
label_smooth_eps=0.01
cls_normalizer=1.0
nms_kind=greedynms
beta_nms=0.6

[route]
layers=-4

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=relu

[upsample]
stride=2

[route]
layers=-1, 23

[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=relu

[convolutional]
size=1
stride=1
pad=1
filters=255
activation=linear

[yolo]
mask=1,2,3
anchors=10,14, 23,27, 37,58, 81,82, 135,169, 344,319
num=6
scale_x_y=1.05
classes=1
iou_thresh=0.213
iou_loss=ciou
iou_normalizer=0.7
obj_normalizer=1.0
label_smooth_eps=0.01
cls_normalizer=1.0
nms_kind=greedynms
beta_nms=0.6
hitch22 commented 3 years ago

I think you will have better luck with darknet for the training part. I had issues with training being inaccurate, but using darknet directly to handle the part resolved my issues.

hhk7734 commented 3 years ago

v3 training is not yet supported. I agree with @hitch22. Train with darknet.

I've seen a lot of code implemented with tensorflow and pytorch, but I haven't yet seen a library that implements the training part of darknet well.