is there any way to improve the yolov4 train speed?

I've implement a custom yolov4 model, and it trained well. but the problem is training speed is much slower than darknet yolov4 i figured out this is from feeding data to model by numpy (every time it generate new augmented data) so I modified data pipeline using tf.dataset as below replacing numpy input (with this, training speed is similar with darknet yolo)

` def get_dataset(self): img_dataset = tf.data.Dataset.from_tensor_slices(self._train_images) img_dataset = img_dataset.map(map_func=self._load_images, num_parallel_calls=AUTOTUNE) img_dataset = img_dataset.map(map_func=self._preprocess_images, num_parallel_calls=AUTOTUNE) lbl_dataset = tf.data.Dataset.from_tensor_slices(self._train_labels) lbl_dataset = lbl_dataset.map(map_func=self._load_labels, num_parallel_calls=AUTOTUNE)

    dataset = tf.data.Dataset.zip((img_dataset, lbl_dataset))

    # dataset = dataset.map(
    #     lambda img_dataset, lbl_dataset: tf.py_function(self.augmentation, [img_dataset, lbl_dataset], [tf.float32, tf.float32])
    # )

    # if self.data_aug:
    #     dataset = dataset.map(
    #         lambda img_dataset, lbl_dataset: self.augmentation(img_dataset, lbl_dataset), num_parallel_calls=AUTOTUNE)

    dataset = dataset.map(
        lambda img_dataset, lbl_dataset: tf.py_function(self.preprocess_true_boxes, [img_dataset, lbl_dataset], [tf.float32, tf.float32, tf.float32]),
                                                        num_parallel_calls=AUTOTUNE)
    # dataset = dataset.map(
    #     lambda img_dataset, lbl_dataset: tf.numpy_function(self.preprocess_true_boxes, [img_dataset, lbl_dataset],
    #                                                     [tf.float32, tf.float32, tf.float32]),
    #     num_parallel_calls=AUTOTUNE)

    dataset = dataset.map(map_func=self._adjust_shape,  num_parallel_calls=AUTOTUNE)

    dataset = dataset.cache("")
    dataset = dataset.shuffle(5000, reshuffle_each_iteration=True)
    dataset = dataset.repeat()
    dataset = dataset.batch(self.batch_size).prefetch(AUTOTUNE)
    return dataset

def __iter__(self):
    return self`

but result is weird. (by using original code, result was good)

what i checked:

labeled data(class, bbox) => it was exactly same with numpy label
the model is same
input image is same
nms is same

could you tell me what maybe possible problem??
or are there any way to improve the training speed using original code?? (i compare this with darknet, training time of darknet was much faster)

hunglc007 / tensorflow-yolov4-tflite

is there any way to improve the yolov4 train speed? #361