training: IO is slow - Githubissues

monajalal commented 6 years ago

during training I keep getting the following: data reader: waiting for data loading (IO slow)

also in the beginning I go the following message: imdb does not contain bounding boxes

Do you get these messages too? Or how should I improve it?

python exp_clevr_snmn/train_net_vqa.py --cfg exp_clevr_snmn/cfgs/vqa_gt_layout.yaml screenshot from 2018-10-03 16-15-51

here is the output of nvidia-smi while it is being trained

[jalal@goku snmn]$ nvidia-smi
Wed Oct  3 16:17:54 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.48                 Driver Version: 410.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:05:00.0  On |                  N/A |
|  0%   31C    P2   159W / 250W |   5644MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:06:00.0 Off |                  N/A |
|  0%   26C    P8    11W / 250W |      2MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      2214      G   /usr/bin/X                                   159MiB |
|    0      2946      C   python                                      4459MiB |
|    0      7217      G   ...quest-channel-token=3890773986348045915   935MiB |
|    0     12989      G   /usr/bin/gnome-shell                          86MiB |
+-----------------------------------------------------------------------------+

please let me know if you have any suggestion.

monajalal commented 6 years ago

It was slow but eventually the training finished with the following results under ~30hrs:

exp: vqa_gt_layout, iter = 200000
    loss (vqa) = 0.050071, loss (layout) = 0.000070, loss (rec) = 0.000000, loss (sharpen) = 0.000000, sharpen_scale = 1.000000
    accuracy (cur) = 0.968750, accuracy (avg) = 0.977994
snapshot saved to ./exp_clevr_snmn/tfmodel/vqa_gt_layout/00200000

mynameischaos commented 5 years ago

how to solve this problem.

monajalal commented 5 years ago

This might sound weird but for us training it on Tesla P100 reduced it to 7hrs which is pretty good. Though depending on what you want out of snmn 30hrs on 1080Ti is not that much of a big deal unless you want to change code frequently and retrain.

mynameischaos commented 5 years ago

how many gpus and prefetch-num, I also use Tesla P100

mynameischaos commented 5 years ago

This might sound weird but for us training it on Tesla P100 reduced it to 7hrs which is pretty good. Though depending on what you want out of snmn 30hrs on 1080Ti is not that much of a big deal unless you want to change code frequently and retrain.

Hello~ I want to learn about more details. Did you just git clone this code and run in a single GPU mode(P100)? it just spent about 7h.

ronghanghu / snmn

training: IO is slow #2