trzy / FasterRCNN

Clean and readable implementations of Faster R-CNN in PyTorch and TensorFlow 2 with Keras.
137 stars 32 forks source link

Training times too long #1

Closed mnaranjorion closed 2 years ago

mnaranjorion commented 2 years ago

Hello,

first of all thanks for the implementation. We are running some tests on a computer with two GPUs "GeForce RTX 3070, 7982 MiB" and we have some doubts, especially regarding the duration of each epoch in training.

python -m tf2.FasterRCNN --train --dataset-dir=./own_dataset/ --epochs=1 --learning-rate=1e-3 --save-best-to=fasterrcnn_tf2_tmp.h5 --no-augment --cache-images

we have durations per epoch of almost 2h.

python -m tf2.FasterRCNN --train --dataset-dir=./own_dataset/ --epochs=1 --learning-rate=1e-3 --save-best-to=fasterrcnn_tf2_tmp.h5

we get the same times per epoch.

--debug-dir=/tmp/tf_debugger/

the duration increases to more than 8h per epoch.

Are we misconfiguring something or is it simply due to the dataset used? Why don't we experience time improvements by removing data augmentation and including image caching?

Thank you very much!

trzy commented 2 years ago

This does indeed sound too long! I recall a similar issue long ago with my VGG-16 repo on Windows but not quite identical: the initial epoch was fast, subsequent epochs became unusably slow. This sounds different.

A few questions:

  1. What OS?
  2. What version of TensorFlow? Can you do a pip freeze and post the results here?
  3. Do you have access to any other systems with a 30-series GPU that you can test on?
  4. How fast does the PyTorch version run?

I wonder if this could be an issue with either the version of TF you're using or CUDA. However, I just tried it with a new Conda environment using the latest version of TensorFlow (I did a fresh pip install -r requirements.txt) on my 3090 in Windows and 2 epochs plus the final validation pass took 40 minutes (roughly 16 minutes for the first epoch, 8 minutes for the second, 13 for the validation pass, a few minutes during startup to parse the dataset).

If there is no obvious solution, I recommend filing a bug with TensorFlow. Unfortunately, last time I did this, the bug was closed after a year and by then some TF update had fixed it already.

mnaranjorion commented 2 years ago

Answering your questions:

thank you very much!

trzy commented 2 years ago

That sounds very odd. I guess TF2 and PyTorch can be ruled out as the issue, leaving CUDA or something else to blame. I will give it a try in Ubuntu tonight (need to reboot into it when I'm done working).

Question: Is the repo and data on a network drive rather than a local drive? It seems like you might be I/O bound. Make absolutely sure that you are doing this on the local disk (in my case, I'm running on an SSD installed in an M2 slot). For example, if you are in an academic or professional environment and are doing this in a home directory (e.g., /home/your_username) on both machines, and it happens to be served remotely, that could be the problem. Perhaps try making a subdirectory in /tmp (e.g., /tmp/fasterrcnn) and put the repo and all data there.

mnaranjorion commented 2 years ago

Hello,

sorry for taking so long to reply, but I've been away.

At first I thought it might be that too, as the data lived on a NAS server and was accessed via volume share from the server where the code and GPUs reside. We backed up all the data to the SSD disk on the server where the code and GPUs reside and it still seems to be the same. In any case I have to check it again to make sure that everything is loaded correctly.

Have you been able to run the test on Ubuntu?

Thank you

trzy commented 2 years ago

I'm running CUDA 11.3 on Ubuntu and it still works. Did a fresh install of a PyTorch environment and tried it in my old TensorFlow environment which uses a docker image for CUDA for some reason (I haven't dared to upgrade that since the beginning of this year).

When you run the TF2 version, is the GPU actually being used? You should see the following output at the beginning:

CUDA Available : yes
GPU Available  : yes
Eager Execution: yes

If "GPU Available" is "no", then you could conceivably see hours-long epoch times. PyTorch is the easiest to get running with GPU. Make sure to follow the last step and use the web site to obtain the exact package list to install. TensorFlow is trickier. I recall having to use an Nvidia docker for CUDA support. Otherwise, if I run it outside of that docker container, TF2 thinks that CUDA is available but has no GPU access.

So I think the potential culprits are:

These days I work in Windows but hopefully we can get this issue resolved because I know most people still torture themselves with Linux ;)

trzy commented 2 years ago

Closed due to inactivity.