Since EfficientDet requieres TensorFlow > 2.8 we can't train anymore with CUDA

google / automl

Google Brain AutoML

Apache License 2.0

6.19k stars 1.45k forks source link

Since EfficientDet requieres TensorFlow > 2.8 we can't train anymore with CUDA #1146

Open fitoule opened 2 years ago

fitoule commented 2 years ago

I have only one NVIDIA GPU, I was training with TensorFlow 2.5.2 because of the bug with GPU and multiprocessing.

TF2.8 and No Child Process => works but Memory Leak :(
TF2.8 and Child Process => CUDA error on the first epoch because GPU has been taken by the main process https://github.com/google/automl/issues/855
TF2.5.2 and Child Process => does not work anymore since fix determinism

It was working with TensorFlow until 2.5.2 but now efficientdet require TF > 2.8 so I am stuck. I have to find code before "determinism" I think

fsx950223 commented 2 years ago

Migrate to tf2
Set num_epochs=1 and num_examples_per_epoch=num_epochs * num_exampels

fitoule commented 2 years ago

You mean I need to use the code under efficientdet/tf2/train.py ? or migrate by myself efficientdet/main.py ?

thank you

exx8 commented 2 years ago

@fitoule you mentioned some memory leak. I am facing too a memory leak. Can you give more info?

mateusz-wozny commented 8 months ago

I faced with the same problem. I used traineval mode, tensorflow 2.10 (then 2.13), in both cases there was memory leak after first epoch. Training was fine, but during evaluation probably CocoCallback cause memory leak. I commented this line (https://github.com/google/automl/blob/master/efficientdet/tf2/train_lib.py#L220) and everything is fine.