Stuck training at Epoch 1/1 Mask RCNN

mhasnat commented 6 years ago

Hi,

I would like to train Mask RCNN for a single object detection and segmentation. I followed the approach described in the train_shapes.ipynb file. Besides, I have verified the prepared data (bounding box and mask) using the inspect_data.ipynb file. However, when I run training with my data (using train_shapes.ipynb file) it does not progress and only displays: Epoch 1/1

At this point, it is difficult to know what went wrong as it does not show any other error message. Therefore, I would like to ask for help to resolve this situation ...

Thanks.

tonyzhao6 commented 6 years ago

Did you change the number of epochs to train for in the call to model.train()?

mhasnat commented 6 years ago

No, I did not change the number of epochs. I wanted to simply follow the training strategy of the shapes in the train_shapes.ipynb file.

Thanks.

tonyzhao6 commented 6 years ago

Are you saying that the code hangs after printing out the line "Epoch 1/1"?

mhasnat commented 6 years ago

Yes, it just displays: Epoch 1/1

There is no other message from which it could be possible to understand what happened!

tonyzhao6 commented 6 years ago

Yea, that's kinda strange. Can you post the outputs from each cell in the Jupyter Notebook up to and including the call to model.train()?

mhasnat commented 6 years ago

Please check the end of page 10 of this attached file. train_detection_2.pdf

tonyzhao6 commented 6 years ago

So it seems like each of your images will only have at most one ground-truth object---is this the case?

If so, you are still requiring 200 detection targets (config.TRAIN_ROIS_PER_IMAGE) per image which might be a lot considering that there's only one ground-truth instance. Try lowering from 200 to 20 and see if the training progresses. Also, check what is the ratio of positive to negative ROIs your network is generating.

sulaimanvesal commented 6 years ago

I have exactly the same problem. Even I tried only with shapes example within the code itself but after showing Epoch 1/1 it just stuck there without the usage of the RAM, GPU but the code is still running. I almost waited for 3 hours but the program is running without any output.

sulaimanvesal commented 6 years ago

SLOVED: The version of my Keras was

2.0.0

and after upgrading it to 2.1.0, now the code is working fine and the training started.

mhasnat commented 6 years ago

Thank you for your valuable responses. Unfortunately, none of the above solutions worked for me.

moinnadeem commented 6 years ago

I'm having a similar bug.

moinnadeem commented 6 years ago

I removed the line "use_multiprocessing=True" from model.py, and it seems to have resolved it. @waleedka, are you able to provide any input as to why this is happening, and the ramifications of removing that line? Thanks!

moinnadeem commented 6 years ago

@waleedka Having different experiences with different machines, please advice. Unsure what's going on with respect to this bug.

mluerig commented 6 years ago

having similar issues, stuck on epoch 1/1, Jupyter kernel keeps dying after ~ 5 seconds.

that is - using the training dataset provided with this repo

Ubuntu 16.04 python 3.6.3 keras 2.1.5 py36_0 tensorflow 1.6.0 0
tensorflow-base 1.6.0 py36hff88cb2_0

MagnIeeT commented 6 years ago

I am also having the same issue. Using the train_shapes notebook. After loading weight (code cell 8) it is consuming almost full GPU memory. Then if I am trying to train the model, kernel keeps dying.

Please help me in resolving the issue.

yaojialuo commented 6 years ago

trying setting workers=1, use_multiprocessing=False in model.py self.keras_model.fit_generator( train_generator, initial_epoch=self.epoch, epochs=epochs, steps_per_epoch=self.config.STEPS_PER_EPOCH, callbacks=callbacks, validation_data=val_generator, validation_steps=self.config.VALIDATION_STEPS, max_queue_size=100, workers=1, use_multiprocessing=False, ) seems work

MagnIeeT commented 6 years ago

@yaojialuo it didn't after setting multiprocessing =False. It works with tensorflow=1.3, but not working with tensorflow=1.7

YubinXie commented 6 years ago

After upgrade tensorflow to 1.7, I am having the same issues. I set the gpu=1 but once the model is loaded, it uses all the two GPU and take all the memory with some 'trash'. Itself keeps outputting:

2018-05-03 21:06:27.171155: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 12.07M (12654848 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY

HanYuanyuaner commented 6 years ago

@YubinXie I have the same problem, did you fix it?

MagnIeeT commented 6 years ago

@HanYuanyuaner try with tensorflow =1.3.

HanYuanyuaner commented 6 years ago

@MagnIeeT I try to uninstall tensorflow-gpu 1.6 and instasll 1.3. But the tensroflow not work

MagnIeeT commented 6 years ago

I am using tensorflow-gpu =1.3, keras=2.1.5. It is working fine.

HanYuanyuaner commented 6 years ago

@MagnIeeT what you cuda and cudnn version?

MagnIeeT commented 6 years ago

@HanYuanyuaner cuda version 8.0 and cudnn 6

paulo-chagas commented 6 years ago

I'm having a similar issue. I can only run 128x128 images I load the weights and the memory is almost fully allocated My gpu is gtx 1060 (6GB) Do you think this is normal due to my memory size or should I be able to load the model and larger images?

YubinXie commented 6 years ago

@paulo-chagas My 8GB GPU only accepts 2 128*128 images at one time.

Suvi-dha commented 6 years ago

I was facing the similar issue. I updated workers=1 and use_multiprocessing=False in model.py and updated my config class in my training code to GPU=3. My code is using all GPUs now without any problem. You may also need to include these 3 lines on the top after importing libraries

config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config)

ashuta03 commented 6 years ago

For the problem of training stuck in Epoch 1/1, this worked for me: https://github.com/keras-team/keras/issues/8595#issuecomment-416111215

zkailinzhang commented 5 years ago

same error ,my tensorflow version is v1.12,keras 2.2.4,,, which version i should change keras , for using multiprocessing ?

pluviosilla commented 5 years ago

I have a similar problem when I pass a class to fit_generator(). When I use a regular generator I get a different set of problems. Have anyone gotten fit_generator() to work?

Post I made on StackOverflow about this problem.

chientranse commented 5 years ago

First of all, you need to make a new python environment and install requirements. After that, repair model.py with multiprocessing=False and worker=1 as above comments. Then install mask-rcnn with command python setup.py install in Mask-RCNN's directory. Finally, in your code, add path of local version of Mask-RCNN library: sys.path.append(r"/home/chientm/loads/Mask_RCNN") # To find local version of the library Make sure STEPS_PER_EPOCH and VALIDATION_STEPS is correct. IMAGE_MAX_DIM is not too large, IMAGES_PER_GPU is small enough, ANCHOR_SIZES are not too large. Any incompatible hyperparameter will stuck your training at Epoch 1/1 forever and Mask-RCNN won't notice you anything went wrong because of Keras is so suck!

Mandar-Patil-651 commented 5 years ago

Consider using Google colab. With its TPU, you won't need to set multiprocessing =False, nor the workers to 1. It allots 12.72GB of RAM. I'm currently training on a 6 class dataset with MAX IMAGE SIZE OF 1024*1024. The RAM being used is about 1.55GB. So it works pretty well!!!

abdou31 commented 5 years ago

trying setting workers=1, use_multiprocessing=False in model.py self.keras_model.fit_generator( train_generator, initial_epoch=self.epoch, epochs=epochs, steps_per_epoch=self.config.STEPS_PER_EPOCH, callbacks=callbacks, validation_data=val_generator, validation_steps=self.config.VALIDATION_STEPS, max_queue_size=100, workers=1, use_multiprocessing=False, ) seems work

I tried this but i also get warning use_multirprocessing=true :(

b1xian commented 5 years ago

[cuda 9.0, cudnn 7.0.5,tensorflow-gpu 1.6.0, keras 2.2.4] I fixed this by add the follow code to model.py: def get_session(): config = tf.ConfigProto() config.gpu_options.allow_growth = True return tf.Session(config=config) keras.backend.tensorflow_backend.set_session(get_session())

2696120622 commented 5 years ago

@Mandar-Patil-651 How to change the code for training with Google colab TPU? Thanks!

swapbagal1 commented 5 years ago

@YubinXie @abdou31 Change 'layers' parameter while training. Previously I was using layers='3+'. 1) model.py: multiprocessing= False, max_queue_size=100

2) model.py: workers=1 instead of workers = multiprocessing.cpu_count()

3) config.py: Reduced TRAIN_ROIS_PER_IMAGE = 50 from 200

4) custom.py: model.train(dataset_train, dataset_val, learning_rate=config.LEARNING_RATE, epochs=2, layers='heads')

It works fine.

fierval commented 4 years ago

Windows: tensorflow-gpu==1.6, keras==2.1.0, + in model.py: multiprocessing=False, workers=1 works fine.

osvadimos commented 4 years ago

In my case it was simply a problem with data. I've created a parser for labelme json files and missed a bit. Hence data was not good enough to begin training. #1873 So make sure you run your data inspection beforehand.

EricEntrup commented 4 years ago

Same problem:

Ubuntu: 18.04 Keras: 2.3.1 Tensorflow: 2.0.0 GPU: NVIDIA RTX 2070 CUDA Version: 10.1

I am using a fit_generator(). This has worked fine for me in the past but currently updated to current versions of tensorflow and keras and now nothing works. Just hangs at EPOCH 1/1

qysnn commented 4 years ago

Maybe this will help https://stackoverflow.com/questions/48038417/keras-stops-working-on-first-epoch set verbose = 1 to have a progress bar

tomgross commented 4 years ago

I have seen a similar problem when porting Mask RCNN to tensorflow 2.0. Actually it happened when switching from standalone keras to tf.keras If you use tenserflow 2.0 you might want to try #1896 and see if it works for you. Currently the latest released version 2.1 and master on github of matterport DOES NOT support tensorflow 2.0 onward at the moment!

Gabrielsonnn commented 4 years ago

Downgrading my Keras from 2.2.4 to 2.1 fixed the issue for me.

NiksanJP commented 4 years ago

trying setting workers=1, use_multiprocessing=False in model.py self.keras_model.fit_generator( train_generator, initial_epoch=self.epoch, epochs=epochs, steps_per_epoch=self.config.STEPS_PER_EPOCH, callbacks=callbacks, validation_data=val_generator, validation_steps=self.config.VALIDATION_STEPS, max_queue_size=100, workers=1, use_multiprocessing=False, ) seems work

Yes but we are trying to train it on multiple GPUs

sqiprasanna commented 4 years ago

Downgrading Keras to 2.1 fixed the issue for me. and also I have used use_multiprocessing = True and workers =1 Still it's working for me

init-22 commented 4 years ago

Having same issue, downgraded keras to 2.1, worker = 1... what could be the problem?

Rashmi-AnonymousNot commented 4 years ago

Downgrading Keras to 2.1 fixed the issue for me. and also I have used use_multiprocessing = True and workers =1 Still it's working for me

Thank you!!!! It's working for me too after installing keras 2.1

FerranRebollar commented 4 years ago

This is the configuration that is working for me (after trying a lot of combinations). Hope it helps: Windows 10 Keras: 2.2.0 tensorflow: 1.13.1 tensorflow-gpu: 1.13.1 CUDA Version: 10.0 cuDNN v7.6.5 NVIDIA drivers v441.22 GPU: RTX 2070 (notebook) In model.py/self.keras_model.fit_generator: workers=1, use_multiprocessing=False GPU_COUNT = 1 IMAGES_PER_GPU = 1

parth-singh71 commented 4 years ago

I had the same issue, I downgraded to:

tensorflow=1.5.0
keras=2.1.5

And now everything is working fine. Note: Use a GPU, It may take a lot of time to execute on a CPU

nandita96 commented 4 years ago

hello everyone, Can anyone help me on this: after setting : step per epoch =200 which means 200 steps to complete the processing of my whole datasets per iteration. and epoch =8, and training dataset= 1000 images.

my question is how many images from training datasets, it take to train 1 epoch? @parth-singh71 @pluviosilla @tomgross @FerranRebollar @Gabrielsonnn @moinnadeem

manasrda commented 3 years ago

Had a very similar issue, this is what worked for me: 1 - make sure IMAGES_PER_GPU = 1 instead of IMAGES_PER_GPU = 2. 2- In file mrcnn/model.py I changed line 2362 to workers = 1 and line 2374 to use_multiprocessing=False,

matterport / Mask_RCNN

Stuck training at Epoch 1/1 Mask RCNN #287