Training is stuck at the beginning of Epoch

suchiz commented 5 years ago

Hello everyone, I'm opening this issue cause I did not see something similar in others yet. Well I saw one which has been solved by modifying the dataset but I checked and it's not my case.

When I'm running the training I get stuck here: . . .mrcnn_bbox_fc (TimeDistributed) mrcnn_mask_deconv (TimeDistributed) mrcnn_class_logits (TimeDistributed) mrcnn_mask (TimeDistributed) . . Epoch 1/5

I used the balloon code and adapted it for my own dataset, as I said I checked labels and masks generated, it's all good. I don't think the issue comes from the dataset.

NUM_CLASSES 2 (mine + bg) GPU_COUNT 1 IMAGES_PER_GPU 1 (As I'm training on CPU, I also set use_multiprocessing to false)

Tensorflow 1.13 Keras 2.0.8 Windows 10 64Bits Python 3.6.8 CPU: i7-6700

I do not have any other warning or hint given by the console ... It's just stuck. My CPU isn't charged, so it's probably not training. Let me know if you need any additional information ... And thanks for helping.

sainatarajan commented 5 years ago

Do you have a GPU?

suchiz commented 5 years ago

@sainatarajan I do have one but I don't use it. As I said, i'm working on cpu and set use_multiprocessing parameter to false. Do you think this is the problem ?

sainatarajan commented 5 years ago

It maybe an issue because once I tried to run Unet model in CPU and it took forever to print even the statement Epoch 1/xxx. I had a 2nd gen i5 though. But, considering the complexity of this model (Maskrcnn), running on CPU might be the issue. Could you try with the gpu and check? Just to find out what could be the problem.

suchiz commented 5 years ago

@sainatarajan Actually I thought it could be that but had no instance to prove it... So I asked. I just saw another topic in this section. He also said it was because of his weak GPU, and it's working fine on a 1050x now.... So okay, I'll try on Monday (it's on my work computer). Thank you very much

sainatarajan commented 5 years ago

@suchiz Welcome!

suchiz commented 5 years ago

@sainatarajan Can you tell me approximately how long was "forever" for your Unet training plz ?

sainatarajan commented 5 years ago

@suchiz I ran it for like say 20-30 min in my home system. After that I stopped. But the same program when I tried to run in my work system (rtx 2080 ti), it took far less time and executed successfully.

suchiz commented 5 years ago

@sainatarajan Alright, thank you very much mate. I've started it 3h ago, and still have no progress. Tomorrow i'll be able to launch it with a GPU (don't have the access for today). Hope this is really the solution... But this CNN is crazy deep if it's the real deal...

sainatarajan commented 5 years ago

@suchiz Yes, this is one of the most complex CNN models and that's the reason why it performs so well too.

suchiz commented 5 years ago

@sainatarajan It was the real deal ........ I'm really shocked about how an i7-6700 cannot get through 1 epoch in 5-6hours whereas the GTX 1080 does 1 epoch in 7min average... Thank you again ! Now closing this issue :)

Dayan-Zhanchi commented 5 years ago

@suchiz Did you have to change anything more in the mask rcnn code apart from adjusting the IMAGE_PER_COUNT = 1 instead of 2? I believe my teammate has the exactly the same graphic card with the same memory as yours, but he still gets the same problem as initially presented in this issue, meaning that the training gets stuck on the very first epoch and doesn't finish at all (the GPU usage stays pretty much at 0-2%, which may indicate that it's not being used). We've checked that tensorflow-gpu is installed + that tensorflow is able to use the GPU. We changed the IMAGE_PER_COUNT variable to 1, so that less memory is used, but ye, we still encounter the same problem.

Thanks in advance!

init-22 commented 4 years ago

@Dayan-Zhanchi I am having the same problem, training is starting, I am using Tesla T4 GPU

sainatarajan commented 4 years ago

@IsaacPatole Check whether adding the following piece of code after import statements work.

import tensorflow as tf

config = tf.compat.v1.ConfigProto()
config.gpu_options.allow_growth = True
tf.compat.v1.keras.backend.set_session(tf.compat.v1.Session(config=config))

init-22 commented 4 years ago

@sainatarajan not working !!

init-22 commented 4 years ago

@sainatarajan I have just 10 images, TF-GPU =1.15, keras=2.3, (tried 2.1 too!) , changed IMAGE_MIN_DIM and IMAGE_MAX_DIM to 320 and 448 respectively VALIDATIN_STEPS and STEPS_PER_EPOCH =10

sainatarajan commented 4 years ago

@IsaacPatole Can you try TF 1.14.0? I had that version when I used this repo. And Keras was 2.2.4

init-22 commented 4 years ago

@sainatarajan just tried that, still doesnt help!

Dayan-Zhanchi commented 4 years ago

@IsaacPatole Hi, was a while ago we did this, so I don't fully remember the details. But if I recall correctly it had to do with us using windows and that leading to some multiprocessing issues that made the GPU not used at all or something. I wrote how we solved it here: https://github.com/matterport/Mask_RCNN/issues/1783#issuecomment-549629212

chinya07 commented 1 year ago

Hi all, I am still having this issue (Training is stuck at Epoch 1) while training a mobilenet Tensorflow 2.9.1 model in AWS Sagemaker. :(

matterport / Mask_RCNN

Training is stuck at the beginning of Epoch #1823