Training code seems to get stuck even for a very small set of input images

indranilsinharoy commented 4 years ago

Thanks very much for sharing the code.

To do a quick test of the training code, I downloaded a few of the youtube clips from the RealEstate 10K dataset, and placed the extracted frames in stereo-magnification\images directory. The corresponding camera files are in stereo-magnification\train directory.

However, when I try to execute the train.py the program doesn't proceed any further than session.run() function (I think). I'm copy-pasting the log below (please note that I've removed some of the warning messages related to some deprecated functions). I don't see any progress following the line INFO:tensorflow:parameter_count = 16892227 even after waiting for several (over 10) hours. Since I placed just a few (around 25) low-resolution images in the images directory, I was expecting the training to finish within a few hours.

INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Starting standard services.
INFO:tensorflow:Starting queue runners.
INFO:tensorflow:Trainable variables: 
INFO:tensorflow:net/conv1_1/weights:0
INFO:tensorflow:net/conv1_1/LayerNorm/beta:0
INFO:tensorflow:net/conv1_1/LayerNorm/gamma:0
INFO:tensorflow:net/conv1_2/weights:0
INFO:tensorflow:net/conv1_2/LayerNorm/beta:0
INFO:tensorflow:net/conv1_2/LayerNorm/gamma:0
INFO:tensorflow:net/conv2_1/weights:0
INFO:tensorflow:net/conv2_1/LayerNorm/beta:0
INFO:tensorflow:net/conv2_1/LayerNorm/gamma:0
INFO:tensorflow:net/conv2_2/weights:0
INFO:tensorflow:net/conv2_2/LayerNorm/beta:0
INFO:tensorflow:net/conv2_2/LayerNorm/gamma:0
INFO:tensorflow:net/conv3_1/weights:0
INFO:tensorflow:net/conv3_1/LayerNorm/beta:0
INFO:tensorflow:net/conv3_1/LayerNorm/gamma:0
INFO:tensorflow:net/conv3_2/weights:0
INFO:tensorflow:net/conv3_2/LayerNorm/beta:0
INFO:tensorflow:net/conv3_2/LayerNorm/gamma:0
INFO:tensorflow:net/conv3_3/weights:0
INFO:tensorflow:net/conv3_3/LayerNorm/beta:0
INFO:tensorflow:net/conv3_3/LayerNorm/gamma:0
INFO:tensorflow:net/conv4_1/weights:0
INFO:tensorflow:net/conv4_1/LayerNorm/beta:0
INFO:tensorflow:net/conv4_1/LayerNorm/gamma:0
INFO:tensorflow:net/conv4_2/weights:0
INFO:tensorflow:net/conv4_2/LayerNorm/beta:0
INFO:tensorflow:net/conv4_2/LayerNorm/gamma:0
INFO:tensorflow:net/conv4_3/weights:0
INFO:tensorflow:net/conv4_3/LayerNorm/beta:0
INFO:tensorflow:net/conv4_3/LayerNorm/gamma:0
INFO:tensorflow:net/conv6_1/weights:0
INFO:tensorflow:net/conv6_1/LayerNorm/beta:0
INFO:tensorflow:net/conv6_1/LayerNorm/gamma:0
INFO:tensorflow:net/conv6_2/weights:0
INFO:tensorflow:net/conv6_2/LayerNorm/beta:0
INFO:tensorflow:net/conv6_2/LayerNorm/gamma:0
INFO:tensorflow:net/conv6_3/weights:0
INFO:tensorflow:net/conv6_3/LayerNorm/beta:0
INFO:tensorflow:net/conv6_3/LayerNorm/gamma:0
INFO:tensorflow:net/conv7_1/weights:0
INFO:tensorflow:net/conv7_1/LayerNorm/beta:0
INFO:tensorflow:net/conv7_1/LayerNorm/gamma:0
INFO:tensorflow:net/conv7_2/weights:0
INFO:tensorflow:net/conv7_2/LayerNorm/beta:0
INFO:tensorflow:net/conv7_2/LayerNorm/gamma:0
INFO:tensorflow:net/conv8_1/weights:0
INFO:tensorflow:net/conv8_1/LayerNorm/beta:0
INFO:tensorflow:net/conv8_1/LayerNorm/gamma:0
INFO:tensorflow:net/conv8_2/weights:0
INFO:tensorflow:net/conv8_2/LayerNorm/beta:0
INFO:tensorflow:net/conv8_2/LayerNorm/gamma:0
INFO:tensorflow:net/color_pred/weights:0
INFO:tensorflow:net/color_pred/biases:0
INFO:tensorflow:parameter_count = 16892227

My system's configuration are provided below: OS: Ubuntu 19.04 Python: 2.7 Tensorflow version: 1.13.1 GPU information:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.50       Driver Version: 430.50       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro RTX 4000     Off  | 00000000:02:00.0  On |                  N/A |
| 30%   39C    P8    12W / 125W |   7678MiB /  7977MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Quadro P4000        Off  | 00000000:03:00.0 Off |                  N/A |
| 46%   33C    P8     5W / 105W |     91MiB /  8119MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     17066      G   /usr/lib/xorg/Xorg                           313MiB |
|    0     31117      C   python                                      7353MiB |
|    1     31117      C   python                                        79MiB |
+-----------------------------------------------------------------------------+

It would be great if you could provide some insight for solving this issue.

Thank you very much.

reyet commented 4 years ago

It sounds like the input pipeline might be running forever but not producing any data. Since it's not much data, can you show us exactly what your directory structure and files looks like with "ls -R"?

indranilsinharoy commented 4 years ago

@reyet Thank you so much for your reply. Please see the directory tree structure below. I have removed some unnecessary parts of the tree to keep it concise. Please note that in this example I used just 5 camera specification files (in the train directory).

indranil@root3563:~/stereo_magnification$ tree
.
├── checkpoints
├── CONTRIBUTING.md
├── evaluate.py
├── examples
├── geometry
├── images
│   ├── Eh6a2OB-xAg
│   │   ├── Eh6a2OB-xAg_158992000.jpg
│   │   ├── Eh6a2OB-xAg_159025000.jpg
│   │   ├── Eh6a2OB-xAg_159059000.jpg
│   │   ├── Eh6a2OB-xAg_159092000.jpg
│   │   ├── Eh6a2OB-xAg_159125000.jpg
│   │   ├── Eh6a2OB-xAg_159159000.jpg
│   │   ├── Eh6a2OB-xAg_159192000.jpg
│   │   ├── Eh6a2OB-xAg_159225000.jpg
│   │   ├── Eh6a2OB-xAg_159259000.jpg
│   │   ├── Eh6a2OB-xAg_159292000.jpg
│   │   ├── Eh6a2OB-xAg_159326000.jpg
│   │   ├── Eh6a2OB-xAg_159359000.jpg
│   │   ├── Eh6a2OB-xAg_159392000.jpg
│   │   ├── Eh6a2OB-xAg_159426000.jpg
│   │   ├── Eh6a2OB-xAg_159459000.jpg
│   │   ├── Eh6a2OB-xAg_159492000.jpg
│   │   ├── Eh6a2OB-xAg_159526000.jpg
│   │   ├── Eh6a2OB-xAg_159559000.jpg
│   │   ├── Eh6a2OB-xAg_159592000.jpg
│   │   ├── Eh6a2OB-xAg_159626000.jpg
│   │   ├── Eh6a2OB-xAg_159659000.jpg
│   │   ├── Eh6a2OB-xAg_159693000.jpg
│   │   ├── Eh6a2OB-xAg_159726000.jpg
│   │   ├── Eh6a2OB-xAg_159759000.jpg
│   │   └── Eh6a2OB-xAg_159793000.jpg
│   ├── f7o82npo-Ww
│   │   ├── f7o82npo-Ww_42076000.jpg
│   │   ├── f7o82npo-Ww_42109000.jpg
│   │   ├── f7o82npo-Ww_42142000.jpg
│   │   ├── f7o82npo-Ww_42176000.jpg
│   │   ├── f7o82npo-Ww_42209000.jpg
│   │   ├── f7o82npo-Ww_42243000.jpg
│   │   ├── f7o82npo-Ww_42276000.jpg
│   │   ├── f7o82npo-Ww_42343000.jpg
│   │   ├── f7o82npo-Ww_42376000.jpg
│   │   ├── f7o82npo-Ww_42409000.jpg
│   │   ├── f7o82npo-Ww_42443000.jpg
│   │   ├── f7o82npo-Ww_42476000.jpg
│   │   ├── f7o82npo-Ww_42509000.jpg
│   │   ├── f7o82npo-Ww_42543000.jpg
│   │   ├── f7o82npo-Ww_42576000.jpg
│   │   ├── f7o82npo-Ww_42610000.jpg
│   │   ├── f7o82npo-Ww_42643000.jpg
│   │   ├── f7o82npo-Ww_42676000.jpg
│   │   ├── f7o82npo-Ww_42742000.jpg
│   │   ├── f7o82npo-Ww_42776000.jpg
│   │   ├── f7o82npo-Ww_42809000.jpg
│   │   ├── f7o82npo-Ww_42842000.jpg
│   │   ├── f7o82npo-Ww_42876000.jpg
│   │   ├── f7o82npo-Ww_42909000.jpg
│   │   ├── f7o82npo-Ww_42943000.jpg
│   │   ├── f7o82npo-Ww_42976000.jpg
│   │   ├── f7o82npo-Ww_43009000.jpg
│   │   ├── f7o82npo-Ww_43043000.jpg
│   │   ├── f7o82npo-Ww_43076000.jpg
│   │   ├── f7o82npo-Ww_43109000.jpg
│   │   ├── f7o82npo-Ww_43143000.jpg
│   │   ├── f7o82npo-Ww_43176000.jpg
│   │   ├── f7o82npo-Ww_43210000.jpg
│   │   ├── f7o82npo-Ww_43243000.jpg
│   │   ├── f7o82npo-Ww_43276000.jpg
│   │   ├── f7o82npo-Ww_43310000.jpg
│   │   ├── f7o82npo-Ww_43343000.jpg
│   │   └── f7o82npo-Ww_43376000.jpg
│   ├── GclE7CWkz1s
│   │   ├── GclE7CWkz1s_150300000.jpg
│   │   ├── GclE7CWkz1s_150333333.jpg
│   │   ├── GclE7CWkz1s_150366667.jpg
│   │   ├── GclE7CWkz1s_150400000.jpg
│   │   ├── GclE7CWkz1s_150433333.jpg
│   │   ├── GclE7CWkz1s_150466667.jpg
│   │   ├── GclE7CWkz1s_150500000.jpg
│   │   ├── GclE7CWkz1s_150533333.jpg
│   │   ├── GclE7CWkz1s_150566667.jpg
│   │   ├── GclE7CWkz1s_150600000.jpg
│   │   ├── GclE7CWkz1s_150633333.jpg
│   │   ├── GclE7CWkz1s_150666667.jpg
│   │   ├── GclE7CWkz1s_150700000.jpg
│   │   ├── GclE7CWkz1s_150733333.jpg
│   │   ├── GclE7CWkz1s_150766667.jpg
│   │   ├── GclE7CWkz1s_150800000.jpg
│   │   ├── GclE7CWkz1s_150833333.jpg
│   │   ├── GclE7CWkz1s_150866667.jpg
│   │   ├── GclE7CWkz1s_150900000.jpg
│   │   ├── GclE7CWkz1s_150933333.jpg
│   │   ├── GclE7CWkz1s_150966667.jpg
│   │   ├── GclE7CWkz1s_151000000.jpg
│   │   ├── GclE7CWkz1s_151033333.jpg
│   │   ├── GclE7CWkz1s_151066667.jpg
│   │   ├── GclE7CWkz1s_151100000.jpg
│   │   ├── GclE7CWkz1s_151133333.jpg
│   │   ├── GclE7CWkz1s_151166667.jpg
│   │   ├── GclE7CWkz1s_151200000.jpg
│   │   ├── GclE7CWkz1s_151233333.jpg
│   │   ├── GclE7CWkz1s_151266667.jpg
│   │   ├── GclE7CWkz1s_151300000.jpg
│   │   ├── GclE7CWkz1s_151333333.jpg
│   │   ├── GclE7CWkz1s_151366667.jpg
│   │   ├── GclE7CWkz1s_151400000.jpg
│   │   ├── GclE7CWkz1s_151433333.jpg
│   │   ├── GclE7CWkz1s_151466667.jpg
│   │   ├── GclE7CWkz1s_151500000.jpg
│   │   ├── GclE7CWkz1s_151533333.jpg
│   │   ├── GclE7CWkz1s_151566667.jpg
│   │   ├── GclE7CWkz1s_151600000.jpg
│   │   └── GclE7CWkz1s_151633333.jpg
│   ├── OT04jHhqYyw
│   │   ├── OT04jHhqYyw_110133333.jpg
│   │   ├── OT04jHhqYyw_110166667.jpg
│   │   ├── OT04jHhqYyw_110200000.jpg
│   │   ├── OT04jHhqYyw_110233333.jpg
│   │   ├── OT04jHhqYyw_110266667.jpg
│   │   ├── OT04jHhqYyw_110300000.jpg
│   │   ├── OT04jHhqYyw_110333333.jpg
│   │   ├── OT04jHhqYyw_110366667.jpg
│   │   ├── OT04jHhqYyw_110400000.jpg
│   │   ├── OT04jHhqYyw_110433333.jpg
│   │   ├── OT04jHhqYyw_110466667.jpg
│   │   ├── OT04jHhqYyw_110500000.jpg
│   │   ├── OT04jHhqYyw_110533333.jpg
│   │   ├── OT04jHhqYyw_110566667.jpg
│   │   ├── OT04jHhqYyw_110600000.jpg
│   │   ├── OT04jHhqYyw_110633333.jpg
│   │   └── OT04jHhqYyw_110666667.jpg
│   └── xTOs9uW6_bo
│       ├── xTOs9uW6_bo_86333333.jpg
│       ├── xTOs9uW6_bo_86366667.jpg
│       ├── xTOs9uW6_bo_86400000.jpg
│       ├── xTOs9uW6_bo_86433333.jpg
│       ├── xTOs9uW6_bo_86466667.jpg
│       ├── xTOs9uW6_bo_86500000.jpg
│       ├── xTOs9uW6_bo_86533333.jpg
│       ├── xTOs9uW6_bo_86566667.jpg
│       ├── xTOs9uW6_bo_86600000.jpg
│       ├── xTOs9uW6_bo_86633333.jpg
│       ├── xTOs9uW6_bo_86666667.jpg
│       ├── xTOs9uW6_bo_86700000.jpg
│       ├── xTOs9uW6_bo_86733333.jpg
│       ├── xTOs9uW6_bo_86766667.jpg
│       ├── xTOs9uW6_bo_86800000.jpg
│       ├── xTOs9uW6_bo_86833333.jpg
│       ├── xTOs9uW6_bo_86866667.jpg
│       ├── xTOs9uW6_bo_86900000.jpg
│       ├── xTOs9uW6_bo_86933333.jpg
│       ├── xTOs9uW6_bo_86966667.jpg
│       ├── xTOs9uW6_bo_87000000.jpg
│       ├── xTOs9uW6_bo_87033333.jpg
│       ├── xTOs9uW6_bo_87066667.jpg
│       ├── xTOs9uW6_bo_87100000.jpg
│       ├── xTOs9uW6_bo_87133333.jpg
│       └── xTOs9uW6_bo_87166667.jpg
├── __init__.py
├── LICENSE
├── models
├── mpi_from_images.py
├── README.md
├── scripts
├── stereomag
├── test.py
├── third_party
├── train
│   ├── 0a0a998c176713fd.txt
│   ├── 0a16a992457df4a8.txt
│   ├── 0a1a4430d0061081.txt
│   ├── 0a6d080826d442d9.txt
│   └── 0a7d51ed7990aefd.txt
├── train.py

Best regards, Indranil.

indranilsinharoy commented 4 years ago

Hi @reyet, Did you get a change to take a look at the directory structure? Do you think there is a problem with it? Thanks very much in advance.

Hi @PuneetKohli , @olivertai, @lolz0r, If you have been able to run the training part, could you kindly suggest anything based on your experience as to how I may be able to resolve this issue? Thanks very much for your time and help.

bruce-wayne99 commented 4 years ago

@indranilsinharoy did you find a solution for this? I am also having the same issue while training the network.

indranilsinharoy commented 4 years ago

@bruce-wayne99 Unfortunately, I've not been able to solve it, and I have temporarily moved on to other things. If I had to try again with some different environment, I would try using Ubuntu 18.x instead of Ubuntu 19.04 (not sure if you are using the same OS or not) and also try different CUDA version ... just some thoughts.

reyet commented 4 years ago

Sorry for not replying sooner. I did take a look at your directory structure, and it looked correct to me so I'm afraid I don't know what is going wrong.

indranilsinharoy commented 4 years ago

@reyet No problem at all. Thank you very much. I was guessing something same :-) Once I get back to it I'll try some more things (mostly with the environment I guess). If I do find the problem, I'll surely post it here.

bruce-wayne99 commented 4 years ago

@indranilsinharoy thanks for the info. The issue was fixed after I changed the Cuda version to 9.0, I was using Cuda version 10.0 before. OS: Ubuntu 16.04 Cuda: cudnn/7.6.4-cuda-9.0, cuda9.0 Tensorflow package: tensorflow-gpu==1.14.0

indranilsinharoy commented 4 years ago

@bruce-wayne99 Thanks very much for posting your solution here. I hope it will help several others if they face similar problems. At least I know that is the first thing I must do!

indranilsinharoy commented 4 years ago

@reyet, @bruce-wayne99 Please feel free to close the issue if you see fit.

reyet commented 4 years ago

Thanks @bruce-wayne99 for figuring that out!

bruce-wayne99 commented 4 years ago

@reyet @indranilsinharoy, Just went through the code more briefly and I think the version is not an issue, if you look at loader.py file, for generating sequences they use a stride, by default min_stride=3 and max_stride=10, each time a random number is generated between min_stride and max_stride to choose the stride, after choosing the stride, they select a subsequence of length 10(sequence_length=10) from a given sequence and also your sequence should have a minimum length of (sequence_length - 1) * stride + 1, they remove all sequences which are of length below this. so in this case since stride is random number let us take it (min_stride + max_stride)//2 which is (3+10)//2 = 6. So each sequence of your data should contain at least contain (10-1)*6 + 1 = 55 frames, so if your sequences are below this length they are not used as data so may be data is becoming null and code is getting stuck, I was able to make it work by setting max_stride=min_stride=1 or another thing to do is to adjust the sequence_length (reducing it) or you can just use bigger dataset (increasing number of frames in a given sequence).

indranilsinharoy commented 4 years ago

@bruce-wayne99 Thanks very much. I'll check it out.

phongnhhn92 commented 4 years ago

Hi @indranilsinharoy, I just want to ask how did u manage to prepare the RealEstate10K dataset for training. Is the txt files in the train folder is the same with download txt file from the dataset ? Also, the image in each folder are the extracted files for each scene id right ? Do you perform any pre-processing on them ?

google / stereo-magnification

Training code seems to get stuck even for a very small set of input images #25