Open mustardlove opened 5 years ago
In DSOD_train.ipynb
, batch size is actually 6 and the gradients get accumulated with AdamAccumulate
for 128//6 batches before a gradient update is performed. This results in a virtual batch size of 126, but the log is updateted after each batch.
Setting the batch size to 4 or even 2 should solve the issue. How large is your GPU memory?
Thank you so much for your kind help! I changed the 512's batch size to 4 and the train code is running!
I'm using two Titan Xp GPU and the memory spec is as follows: 11.4 GbpsMemory Speed 12 GB GDDR5XStandard Memory Config 384-bitMemory Interface Width 547.7 GB/sMemory Bandwidth (GB/sec)
currently the execution is using only 1 GPU..don't know why
I have one more question!
In your data_coco.py, there is convert_to_voc function. I'm only using COCO dataset, so in DSOD_trian, I commented out codes related to VOC dataset and did gt_util_train = gt_util_coco.convert_to_voc() gt_util_val = gt_util_coco_val.convert_to_voc() Does this code make DSOD_Train to train on only 21 categories? I figured you only have 21 initial weights.
I've always used 1 GPU for training a model, but it should work with multiple GPUs as well. The documentation of Model.fit_generator()
explains how to do this.
convert_to_voc
in the COCO case returns a new GTUtility
with COCO data, but with the 20 (21 including background) VOC classes leading to a model with 21 categories.
The weights you mentioned are not trainable parameters... See #14 for more details.
Thank you for the reply!
I played some parameters in fit_generator() (use_multiprocessing=True
, workers=2
) but still only one gpu was on.
I also tried using multi_gpu_model
from keras.utils, but failed with _TfDeviceCaptureOp does not have method _set_device_from_string.
I found that the class _TfDeviceCaptureOp in tensorflow/python/keras/backend.py does have _set_device_from_string, but in keras/backend/tensorflow_backend.py does not have that method..
If anyone solved this issue, please share your knowledge Thank you!
Search for keras.utils.multi_gpu_model
, use_multiprocessing=True, workers=2
refers to data loading.
Hello, Mr.Volk Thank you very much for your nice codes! I have one question for you
I'm new to deep learning, have only basic understanding about keras codes, and currently trying to run your DSOD_train.py. Problem is, I keep getting OOM errors while executing the "Train" section of the code (error message below)
I tried to use only one GPU out of two I have, and to use 'allow_growth' option in tensorflow, and neither worked I believe I need to reduce the size of minibatch(guess your code using batch size 128, am I right?), but I have no idea where to find the code to make this change. (just changing batch_size = 26 to some number lower didn't solve the problem, so I searched your .py files, ended up with no clue)
I'd really appreciate your help on my problem
By the way, I'm using Ubuntu 16.04 and latest tensorflow-keras
------------------------------------------------error message
ResourceExhaustedError Traceback (most recent call last)