Open mhasnat opened 6 years ago
Did you change the number of epochs to train for in the call to model.train()?
No, I did not change the number of epochs. I wanted to simply follow the training strategy of the shapes in the train_shapes.ipynb file.
Thanks.
Are you saying that the code hangs after printing out the line "Epoch 1/1"?
Yes, it just displays:
Epoch 1/1
There is no other message from which it could be possible to understand what happened!
Yea, that's kinda strange. Can you post the outputs from each cell in the Jupyter Notebook up to and including the call to model.train()?
Please check the end of page 10 of this attached file. train_detection_2.pdf
So it seems like each of your images will only have at most one ground-truth object---is this the case?
If so, you are still requiring 200 detection targets (config.TRAIN_ROIS_PER_IMAGE) per image which might be a lot considering that there's only one ground-truth instance. Try lowering from 200 to 20 and see if the training progresses. Also, check what is the ratio of positive to negative ROIs your network is generating.
I have exactly the same problem. Even I tried only with shapes example within the code itself but after showing Epoch 1/1 it just stuck there without the usage of the RAM, GPU but the code is still running. I almost waited for 3 hours but the program is running without any output.
SLOVED: The version of my Keras was
2.0.0
and after upgrading it to 2.1.0, now the code is working fine and the training started.
Thank you for your valuable responses. Unfortunately, none of the above solutions worked for me.
I'm having a similar bug.
I removed the line "use_multiprocessing=True" from model.py, and it seems to have resolved it. @waleedka, are you able to provide any input as to why this is happening, and the ramifications of removing that line? Thanks!
@waleedka Having different experiences with different machines, please advice. Unsure what's going on with respect to this bug.
having similar issues, stuck on epoch 1/1, Jupyter kernel keeps dying after ~ 5 seconds.
that is - using the training dataset provided with this repo
Ubuntu 16.04
python 3.6.3
keras 2.1.5 py36_0
tensorflow 1.6.0 0
tensorflow-base 1.6.0 py36hff88cb2_0
I am also having the same issue. Using the train_shapes notebook. After loading weight (code cell 8) it is consuming almost full GPU memory. Then if I am trying to train the model, kernel keeps dying.
Please help me in resolving the issue.
trying setting workers=1, use_multiprocessing=False in model.py self.keras_model.fit_generator( train_generator, initial_epoch=self.epoch, epochs=epochs, steps_per_epoch=self.config.STEPS_PER_EPOCH, callbacks=callbacks, validation_data=val_generator, validation_steps=self.config.VALIDATION_STEPS, max_queue_size=100, workers=1, use_multiprocessing=False, ) seems work
@yaojialuo it didn't after setting multiprocessing =False. It works with tensorflow=1.3, but not working with tensorflow=1.7
After upgrade tensorflow to 1.7, I am having the same issues. I set the gpu=1 but once the model is loaded, it uses all the two GPU and take all the memory with some 'trash'. Itself keeps outputting:
2018-05-03 21:06:27.171155: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 12.07M (12654848 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
@YubinXie I have the same problem, did you fix it?
@HanYuanyuaner try with tensorflow =1.3.
@MagnIeeT I try to uninstall tensorflow-gpu 1.6 and instasll 1.3. But the tensroflow not work
I am using tensorflow-gpu =1.3, keras=2.1.5. It is working fine.
@MagnIeeT what you cuda and cudnn version?
@HanYuanyuaner cuda version 8.0 and cudnn 6
I'm having a similar issue. I can only run 128x128 images I load the weights and the memory is almost fully allocated My gpu is gtx 1060 (6GB) Do you think this is normal due to my memory size or should I be able to load the model and larger images?
@paulo-chagas My 8GB GPU only accepts 2 128*128 images at one time.
I was facing the similar issue. I updated workers=1 and use_multiprocessing=False in model.py and updated my config class in my training code to GPU=3. My code is using all GPUs now without any problem. You may also need to include these 3 lines on the top after importing libraries
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config)
For the problem of training stuck in Epoch 1/1, this worked for me: https://github.com/keras-team/keras/issues/8595#issuecomment-416111215
same error ,my tensorflow version is v1.12,keras 2.2.4,,, which version i should change keras , for using multiprocessing ?
I have a similar problem when I pass a class to fit_generator()
. When I use a regular generator I get a different set of problems. Have anyone gotten fit_generator()
to work?
Post I made on StackOverflow about this problem.
First of all, you need to make a new python environment and install requirements.
After that, repair model.py with multiprocessing=False and worker=1 as above comments.
Then install mask-rcnn with command python setup.py install
in Mask-RCNN's directory.
Finally, in your code, add path of local version of Mask-RCNN library:
sys.path.append(r"/home/chientm/loads/Mask_RCNN") # To find local version of the library
Make sure STEPS_PER_EPOCH and VALIDATION_STEPS is correct. IMAGE_MAX_DIM is not too large, IMAGES_PER_GPU is small enough, ANCHOR_SIZES are not too large. Any incompatible hyperparameter will stuck your training at Epoch 1/1 forever and Mask-RCNN won't notice you anything went wrong because of Keras is so suck!
Consider using Google colab. With its TPU, you won't need to set multiprocessing =False, nor the workers to 1. It allots 12.72GB of RAM. I'm currently training on a 6 class dataset with MAX IMAGE SIZE OF 1024*1024. The RAM being used is about 1.55GB. So it works pretty well!!!
trying setting workers=1, use_multiprocessing=False in model.py self.keras_model.fit_generator( train_generator, initial_epoch=self.epoch, epochs=epochs, steps_per_epoch=self.config.STEPS_PER_EPOCH, callbacks=callbacks, validation_data=val_generator, validation_steps=self.config.VALIDATION_STEPS, max_queue_size=100, workers=1, use_multiprocessing=False, ) seems work
I tried this but i also get warning use_multirprocessing=true :(
[cuda 9.0, cudnn 7.0.5,tensorflow-gpu 1.6.0, keras 2.2.4] I fixed this by add the follow code to model.py: def get_session(): config = tf.ConfigProto() config.gpu_options.allow_growth = True return tf.Session(config=config) keras.backend.tensorflow_backend.set_session(get_session())
@Mandar-Patil-651 How to change the code for training with Google colab TPU? Thanks!
@YubinXie @abdou31 Change 'layers' parameter while training. Previously I was using layers='3+'. 1) model.py: multiprocessing= False, max_queue_size=100
2) model.py: workers=1 instead of workers = multiprocessing.cpu_count()
3) config.py: Reduced TRAIN_ROIS_PER_IMAGE = 50 from 200
4) custom.py: model.train(dataset_train, dataset_val, learning_rate=config.LEARNING_RATE, epochs=2, layers='heads')
It works fine.
Windows: tensorflow-gpu==1.6, keras==2.1.0, + in model.py: multiprocessing=False, workers=1 works fine.
In my case it was simply a problem with data. I've created a parser for labelme json files and missed a bit. Hence data was not good enough to begin training. #1873 So make sure you run your data inspection beforehand.
Same problem:
Ubuntu: 18.04 Keras: 2.3.1 Tensorflow: 2.0.0 GPU: NVIDIA RTX 2070 CUDA Version: 10.1
I am using a fit_generator(). This has worked fine for me in the past but currently updated to current versions of tensorflow and keras and now nothing works. Just hangs at EPOCH 1/1
Maybe this will help https://stackoverflow.com/questions/48038417/keras-stops-working-on-first-epoch
set verbose = 1
to have a progress bar
I have seen a similar problem when porting Mask RCNN to tensorflow 2.0. Actually it happened when switching from standalone keras to tf.keras If you use tenserflow 2.0 you might want to try #1896 and see if it works for you. Currently the latest released version 2.1 and master on github of matterport DOES NOT support tensorflow 2.0 onward at the moment!
Downgrading my Keras from 2.2.4 to 2.1 fixed the issue for me.
trying setting workers=1, use_multiprocessing=False in model.py self.keras_model.fit_generator( train_generator, initial_epoch=self.epoch, epochs=epochs, steps_per_epoch=self.config.STEPS_PER_EPOCH, callbacks=callbacks, validation_data=val_generator, validation_steps=self.config.VALIDATION_STEPS, max_queue_size=100, workers=1, use_multiprocessing=False, ) seems work
Yes but we are trying to train it on multiple GPUs
Downgrading Keras to 2.1 fixed the issue for me. and also I have used use_multiprocessing = True and workers =1 Still it's working for me
Having same issue, downgraded keras to 2.1, worker = 1... what could be the problem?
Downgrading Keras to 2.1 fixed the issue for me. and also I have used use_multiprocessing = True and workers =1 Still it's working for me
Thank you!!!! It's working for me too after installing keras 2.1
This is the configuration that is working for me (after trying a lot of combinations). Hope it helps: Windows 10 Keras: 2.2.0 tensorflow: 1.13.1 tensorflow-gpu: 1.13.1 CUDA Version: 10.0 cuDNN v7.6.5 NVIDIA drivers v441.22 GPU: RTX 2070 (notebook) In model.py/self.keras_model.fit_generator: workers=1, use_multiprocessing=False GPU_COUNT = 1 IMAGES_PER_GPU = 1
I had the same issue, I downgraded to:
tensorflow=1.5.0
keras=2.1.5
And now everything is working fine. Note: Use a GPU, It may take a lot of time to execute on a CPU
hello everyone, Can anyone help me on this: after setting : step per epoch =200 which means 200 steps to complete the processing of my whole datasets per iteration. and epoch =8, and training dataset= 1000 images.
my question is how many images from training datasets, it take to train 1 epoch? @parth-singh71 @pluviosilla @tomgross @FerranRebollar @Gabrielsonnn @moinnadeem
Had a very similar issue, this is what worked for me:
1 - make sure IMAGES_PER_GPU = 1
instead of IMAGES_PER_GPU = 2
.
2- In file mrcnn/model.py I changed line 2362 to workers = 1
and line 2374 to use_multiprocessing=False,
Hi,
I would like to train Mask RCNN for a single object detection and segmentation. I followed the approach described in the train_shapes.ipynb file. Besides, I have verified the prepared data (bounding box and mask) using the inspect_data.ipynb file. However, when I run training with my data (using train_shapes.ipynb file) it does not progress and only displays:
Epoch 1/1
At this point, it is difficult to know what went wrong as it does not show any other error message. Therefore, I would like to ask for help to resolve this situation ...
Thanks.