google / lasr

Code for "LASR: Learning Articulated Shape Reconstruction from a Monocular Video". CVPR 2021.
https://lasr-google.github.io/
Apache License 2.0
170 stars 17 forks source link

Pdb mode #3

Closed Kana-alt closed 3 years ago

Kana-alt commented 3 years ago

Excuse me for asking again and again.

"bash scripts/spot3.sh". After running the above code, terminal goes into Pdb mode. What should I enter here?

gengshan-y commented 3 years ago

Could you attach the full terminal outputs? This suggests loss goes to nan.

Kana-alt commented 3 years ago

The output is as follows

$ bash scripts/spot3.sh Jitting Chamfer 3D Loaded JIT 3D CUDA chamfer distance 1/0

workers: 1

pairs: 1

init:0, end:-1 198 paris of images Only the mean shape is symmetric! found mean v found tex found ctl rotation found rest translation found ctl points found log ctl running k-means on cuda:0.. [running kmeans]: 0it [00:00, ?it/s, center_shift=12.755209, iteration=1, tol=0.[running kmeans]: 1it [00:00, 160.34it/s, center_shift=2.029201, iteration=2, to[running kmeans]: 2it [00:00, 216.21it/s, center_shift=0.611189, iteration=3, to[running kmeans]: 3it [00:00, 236.17it/s, center_shift=0.375465, iteration=4, to[running kmeans]: 4it [00:00, 256.50it/s, center_shift=0.147063, iteration=5, to[running kmeans]: 5it [00:00, 271.09it/s, center_shift=0.066212, iteration=6, to[running kmeans]: 6it [00:00, 285.10it/s, center_shift=0.023095, iteration=7, to[running kmeans]: 7it [00:00, 290.22it/s, center_shift=0.012533, iteration=8, to[running kmeans]: 8it [00:00, 294.34it/s, center_shift=0.013412, iteration=9, to[running kmeans]: 9it [00:00, 289.23it/s, center_shift=0.000769, iteration=10, t[running kmeans]: 10it [00:00, 289.80it/s, center_shift=0.000823, iteration=11, [running kmeans]: 11it [00:00, 296.20it/s, center_shift=0.002879, iteration=12, [running kmeans]: 12it [00:00, 297.93it/s, center_shift=0.003466, iteration=13, [running kmeans]: 13it [00:00, 300.24it/s, center_shift=0.002981, iteration=14, [running kmeans]: 14it [00:00, 300.13it/s, center_shift=0.001947, iteration=15, [running kmeans]: 15it [00:00, 304.10it/s, center_shift=0.002470, iteration=16, [running kmeans]: 16it [00:00, 308.03it/s, center_shift=0.000000, iteration=17, tol=0.000100]running k-means on cuda:0..

[running kmeans]: 0it [00:00, ?it/s, center_shift=22.510870, iteration=1, tol=0.[running kmeans]: 1it [00:00, 151.02it/s, center_shift=2.741510, iteration=2, to[running kmeans]: 2it [00:00, 200.07it/s, center_shift=1.220065, iteration=3, to[running kmeans]: 3it [00:00, 234.44it/s, center_shift=0.853157, iteration=4, to[running kmeans]: 4it [00:00, 258.50it/s, center_shift=0.540455, iteration=5, to[running kmeans]: 5it [00:00, 266.42it/s, center_shift=0.377577, iteration=6, to[running kmeans]: 6it [00:00, 268.40it/s, center_shift=0.331676, iteration=7, to[running kmeans]: 7it [00:00, 277.28it/s, center_shift=0.322707, iteration=8, to[running kmeans]: 8it [00:00, 282.64it/s, center_shift=0.175176, iteration=9, to[running kmeans]: 9it [00:00, 288.53it/s, center_shift=0.111375, iteration=10, t[running kmeans]: 10it [00:00, 295.35it/s, center_shift=0.088879, iteration=11, [running kmeans]: 11it [00:00, 289.55it/s, center_shift=0.018423, iteration=12, [running kmeans]: 12it [00:00, 294.77it/s, center_shift=0.000564, iteration=13, [running kmearunning k-means on cuda:0..s, center_shift=0.000000, iteration=14, tol=0.000100]

[running kmeans]: 0it [00:00, ?it/s, center_shift=16.797604, iteration=1, tol=0.[running kmeans]: 1it [00:00, 156.77it/s, center_shift=2.243906, iteration=2, to[running kmeans]: 2it [00:00, 219.65it/s, center_shift=1.096395, iteration=3, to[running kmeans]: 3it [00:00, 254.32it/s, center_shift=0.756447, iteration=4, to[running kmeans]: 4it [00:00, 263.37it/s, center_shift=0.374705, iteration=5, to[running kmeans]: 5it [00:00, 276.21it/s, center_shift=0.398951, iteration=6, to[running kmeans]: 6it [00:00, 281.59it/s, center_shift=0.259081, iteration=7, to[running kmeans]: 7it [00:00, 288.73it/s, center_shift=0.234184, iteration=8, to[running kmeans]: 8it [00:00, 289.75it/s, center_shift=0.113684, iteration=9, to[running kmeans]: 9it [00:00, 283.74it/s, center_shift=0.077018, iteration=10, t[running kmeans]: 10it [00:00, 288.88it/s, center_shift=0.045436, iteration=11, [running kmeans]: 11it [00:00, 294.71it/s, center_shift=0.025897, iteration=12, [running kmeans]: 12it [00:00, 297.20it/s, center_shift=0.016986, iteration=13, [running kmeans]: 13it [00:00, 301.32it/s, center_shift=0.004283, iteration=14, [running kmeans]: 14it [00:00, 305.79it/s, center_shift=0.006879, iteration=15, [running kmeans]: 15it [00:00, 306.58it/s, center_shift=0.000763, iteration=16, [running kmearunning k-means on cuda:0..s, center_shift=0.000664, iteration=17, [running kmeans]: 17it [00:00, 305.54it/s, center_shift=0.000000, iteration=18, tol=0.000100]

[running kmeans]: 0it [00:00, ?it/s, center_shift=16.565763, iteration=1, tol=0.[running kmeans]: 1it [00:00, 160.76it/s, center_shift=1.930030, iteration=2, to[running kmeans]: 2it [00:00, 217.14it/s, center_shift=1.246738, iteration=3, to[running kmeans]: 3it [00:00, 253.97it/s, center_shift=0.760291, iteration=4, to[running kmeans]: 4it [00:00, 275.72it/s, center_shift=0.279815, iteration=5, to[running kmeans]: 5it [00:00, 289.73it/s, center_shift=0.157661, iteration=6, to[running kmeans]: 6it [00:00, 301.82it/s, center_shift=0.155566, iteration=7, to[running kmeans]: 7it [00:00, 284.76it/s, center_shift=0.048134, iteration=8, to[running kmeans]: 8it [00:00, 287.96it/s, center_shift=0.032563, iteration=9, to[running kmeans]: 9it [00:00, 294.43it/s, center_shift=0.038816, iteration=10, t[running kmeans]: 10it [00:00, 298.07it/s, center_shift=0.030720, iteration=11, [running kmeans]: 11it [00:00, 304.17it/s, center_shift=0.008083, iteration=12, [running kmeans]: 12it [00:00, 304.28it/s, center_shift=0.008150, iteration=13, [running kmeans]: 13it [00:00, 306.31it/s, center_shift=0.004026, iteration=14, [running kmeans]: 14it [00:00, 311.10it/s, center_shift=0.005496, iteration=15, [running kmearunning k-means on cuda:0..s, center_shift=0.009976, iteration=16, [running kmeans]: 16it [00:00, 313.87it/s, center_shift=0.000856, iteration=17, [running kmeans]: 17it [00:00, 316.71it/s, center_shift=0.000000, iteration=18, tol=0.000100]

[running kmeans]: 0it [00:00, ?it/s, center_shift=16.706129, iteration=1, tol=0.[running kmeans]: 1it [00:00, 175.88it/s, center_shift=4.650821, iteration=2, to[running kmeans]: 2it [00:00, 236.38it/s, center_shift=1.953743, iteration=3, to[running kmeans]: 3it [00:00, 266.52it/s, center_shift=0.789582, iteration=4, to[running kmeans]: 4it [00:00, 289.04it/s, center_shift=0.689171, iteration=5, to[running kmeans]: 5it [00:00, 292.76it/s, center_shift=0.236589, iteration=6, to[running kmeans]: 6it [00:00, 293.24it/s, center_shift=0.145930, iteration=7, to[running kmeans]: 7it [00:00, 300.79it/s, center_shift=0.031621, iteration=8, to[running kmeans]: 8it [00:00, 305.67it/s, center_shift=0.017276, iteration=9, to[running kmeans]: 9it [00:00, 301.45it/s, center_shift=0.020877, iteration=10, t[running kmeans]: 10it [00:00, 300.23it/s, center_shift=0.005033, iteration=11, [running kmeans]: 11it [00:00, 296.60it/s, center_shift=0.019400, iteration=12, [running kmeans]: 12it [00:00, 301.02it/s, center_shift=0.011062, iteration=13, [running kmearunning k-means on cuda:0..s, center_shift=0.003279, iteration=14, [running kmeans]: 14it [00:00, 305.77it/s, center_shift=0.003767, iteration=15, [running kmeans]: 15it [00:00, 304.70it/s, center_shift=0.000738, iteration=16, [running kmeans]: 16it [00:00, 307.97it/s, center_shift=0.000000, iteration=17, tol=0.000100]

[running kmeans]: 0it [00:00, ?it/s, center_shift=14.261992, iteration=1, tol=0.[running kmeans]: 1it [00:00, 146.41it/s, center_shift=2.185611, iteration=2, to[running kmeans]: 2it [00:00, 199.73it/s, center_shift=0.917551, iteration=3, to[running kmeans]: 3it [00:00, 219.22it/s, center_shift=0.308779, iteration=4, to[running kmeans]: 4it [00:00, 225.86it/s, center_shift=0.263299, iteration=5, to[running kmeans]: 5it [00:00, 244.03it/s, center_shift=0.155596, iteration=6, to[running kmeans]: 6it [00:00, 254.71it/s, center_shift=0.107818, iteration=7, to[running kmeans]: 7it [00:00, 267.17it/s, center_shift=0.053746, iteration=8, to[running kmeans]: 8it [00:00, 270.68it/s, center_shift=0.025596, iteration=9, to[running kmearunning k-means on cuda:0.., center_shift=0.012926, iteration=10, t[running kmeans]: 10it [00:00, 277.81it/s, center_shift=0.000798, iteration=11, [running kmeans]: 11it [00:00, 277.94it/s, center_shift=0.000676, iteration=12, [running kmeans]: 12it [00:00, 281.96it/s, center_shift=0.000821, iteration=13, [running kmeans]: 13it [00:00, 283.29it/s, center_shift=0.000000, iteration=14, tol=0.000100]

[running kmeans]: 0it [00:00, ?it/s, center_shift=15.424303, iteration=1, tol=0.[running kmeans]: 1it [00:00, 153.88it/s, center_shift=1.842625, iteration=2, to[running kmeans]: 2it [00:00, 212.78it/s, center_shift=0.643740, iteration=3, to[running kmeans]: 3it [00:00, 248.36it/s, center_shift=0.337034, iteration=4, to[running kmeans]: 4it [00:00, 254.30it/s, center_shift=0.257219, iteration=5, to[running kmeans]: 5it [00:00, 252.33it/s, center_shift=0.063849, iteration=6, to[running kmeans]: 6it [00:00, 263.81it/s, center_shift=0.025937, iteration=7, to[running kmeans]: 7it [00:00, 274.17it/s, center_shift=0.014644, iteration=8, to[running kmeans]: 8it [00:00, 283.65it/s, center_shift=0.003181, iteration=9, to[running kmeans]: 9it [00:00, 285.06it/s, center_shift=0.006797, iteration=10, t[running kmeans]: 10it [00:00, 291.43it/s, center_shift=0.007008, iteration=11, [running kmeans]: 11it [00:00, 297.45it/s, center_shift=0.002124, iteration=12, [running kmearunning k-means on cuda:0..s, center_shift=0.002736, iteration=13, [running kmeans]: 13it [00:00, 298.72it/s, center_shift=0.003066, iteration=14, [running kmeans]: 14it [00:00, 296.97it/s, center_shift=0.000581, iteration=15, [running kmeans]: 15it [00:00, 292.12it/s, center_shift=0.003025, iteration=16, [running kmeans]: 16it [00:00, 294.59it/s, center_shift=0.002441, iteration=17, [running kmeans]: 17it [00:00, 297.42it/s, center_shift=0.000000, iteration=18, tol=0.000100]

[running kmeans]: 0it [00:00, ?it/s, center_shift=21.720541, iteration=1, tol=0.[running kmeans]: 1it [00:00, 171.91it/s, center_shift=2.742992, iteration=2, to[running kmeans]: 2it [00:00, 209.94it/s, center_shift=1.623287, iteration=3, to[running kmeans]: 3it [00:00, 233.26it/s, center_shift=0.645541, iteration=4, to[running kmeans]: 4it [00:00, 250.92it/s, center_shift=0.422107, iteration=5, to[running kmeans]: 5it [00:00, 268.94it/s, center_shift=0.206953, iteration=6, to[running kmeans]: 6it [00:00, 272.19it/s, center_shift=0.064662, iteration=7, to[running kmeanew bone locations65.05it/s, center_shift=0.025331, iteration=8, toscores:g kmeans]: 8it [00:00, 274.59it/s, center_shift=0.060801, iteration=9, totensor([0., 0., 0., 0., 0., 0., 0., 0.], device='cuda:0')049627, iteration=10, tselecting 0eans]: 10it [00:00, 279.63it/s, center_shift=0.019994, iteration=11, [running kmeans]: 17it [00:00, 40.11it/s, center_shift=0.000000, iteration=17, tol=0.000100] ns]: 12it [00:00, 282.82it/s, center_shift=0.006282, iteration=13, [running kmeans]: 14it [00:00, 37.74it/s, center_shift=0.000000, iteration=14, tol=0.000100] [running kmeans]: 18it [00:00, 55.43it/s, center_shift=0.000000, iteration=18, tol=0.000100] [running kmeans]: 18it [00:00, 67.46it/s, center_shift=0.000000, iteration=18, tol=0.000100] [running kmeans]: 17it [00:00, 81.03it/s, center_shift=0.000000, iteration=17, tol=0.000100] [running kmeans]: 14it [00:00, 89.70it/s, center_shift=0.000000, iteration=14, tol=0.000100] [running kmeans]: 18it [00:00, 166.93it/s, center_shift=0.000000, iteration=18, tol=0.000100] [running kmeans]: 14it [00:00, 285.18it/s, center_shift=0.000000, iteration=14, tol=0.000100] /home/kana/anaconda3/envs/lasr/lib/python3.8/site-packages/kornia/geometry/conversions.py:506: UserWarning: XYZW quaternion coefficient order is deprecated and will be removed after > 0.6. Please use QuaternionCoeffOrder.WXYZ instead. warnings.warn("XYZW quaternion coefficient order is deprecated and" /home/kana/anaconda3/envs/lasr/lib/python3.8/site-packages/torch/nn/functional.py:3385: UserWarning: Default grid_sample and affine_grid behavior has changed to align_corners=False since 1.3.0. Please specify align_corners=True if the old behavior is desired. See the documentation of grid_sample for details. warnings.warn("Default grid_sample and affine_grid behavior has changed " /home/kana/lasr/nnutils/train_utils.py:76: RuntimeWarning: invalid value encountered in true_divide timg = (timg-timg.min())/(timg.max()-timg.min())

/home/kana/lasr/nnutils/train_utils.py(295)train() -> self.optimizer.step() (Pdb)

gengshan-y commented 3 years ago

Interesting, it stops at the first few iterations. This might be caused by improperly generated data. Were you able to run optimization for camel?

Update: I uploaded the pre-rendered spot data for you to compare.

Kana-alt commented 3 years ago

No, Camel's optimization also fails.

bash scripts/render_result.sh camel

Setting up model.. loading log/camel-5/pred_net_10.pth.. Traceback (most recent call last): File "extract.py", line 253, in app.run(main) File "/home/kana/anaconda3/envs/lasr/lib/python3.8/site-packages/absl/app.py", line 303, in run _run_main(main, args) File "/home/kana/anaconda3/envs/lasr/lib/python3.8/site-packages/absl/app.py", line 251, in _run_main sys.exit(main(argv)) File "extract.py", line 237, in main predictor = pred_util.MeshPredictor(opts) File "/home/kana/lasr/nnutils/predictor.py", line 60, in init self.load_network(self.model, 'pred', self.opts.num_train_epoch) File "/home/kana/lasr/nnutils/predictor.py", line 108, in load_network states = torch.load(save_path) File "/home/kana/anaconda3/envs/lasr/lib/python3.8/site-packages/torch/serialization.py", line 581, in load with _open_file_like(f, 'rb') as opened_file: File "/home/kana/anaconda3/envs/lasr/lib/python3.8/site-packages/torch/serialization.py", line 230, in _open_file_like return _open_file(name_or_buffer, mode) File "/home/kana/anaconda3/envs/lasr/lib/python3.8/site-packages/torch/serialization.py", line 211, in init super(_open_file, self).init(open(name, mode)) FileNotFoundError: [Errno 2] No such file or directory: 'log/camel-5/pred_net_10.pth' ffmpeg version 3.4.8-0ubuntu0.2 Copyright (c) 2000-2020 the FFmpeg developers built with gcc 7 (Ubuntu 7.5.0-3ubuntu1~18.04) configuration: --prefix=/usr --extra-version=0ubuntu0.2 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --enable-gpl --disable-stripping --enable-avresample --enable-avisynth --enable-gnutls --enable-ladspa --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librubberband --enable-librsvg --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvorbis --enable-libvpx --enable-libwavpack --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzmq --enable-libzvbi --enable-omx --enable-openal --enable-opengl --enable-sdl2 --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-chromaprint --enable-frei0r --enable-libopencv --enable-libx264 --enable-shared libavutil 55. 78.100 / 55. 78.100 libavcodec 57.107.100 / 57.107.100 libavformat 57. 83.100 / 57. 83.100 libavdevice 57. 10.100 / 57. 10.100 libavfilter 6.107.100 / 6.107.100 libavresample 3. 7. 0 / 3. 7. 0 libswscale 4. 8.100 / 4. 8.100 libswresample 2. 9.100 / 2. 9.100 libpostproc 54. 7.100 / 54. 7.100 [image2 @ 0x56294c2da8e0] Pattern type 'glob_sequence' is deprecated: use pattern_type 'glob' instead [image2 @ 0x56294c2da8e0] Could not open file : log/camel-5/render-.png [image2 @ 0x56294c2da8e0] Could not find codec parameters for stream 0 (Video: png, none(pc)): unspecified size Consider increasing the value for the 'analyzeduration' and 'probesize' options Input #0, image2, from 'log/camel-5/render-%.png': Duration: 00:00:00.04, start: 0.000000, bitrate: N/A Stream #0:0: Video: png, none(pc), 25 tbr, 25 tbn, 25 tbc Output #0, mp4, to 'log/camel-5/camel-camel-5-10.mp4': Output file #0 does not contain any stream ffmpeg version 3.4.8-0ubuntu0.2 Copyright (c) 2000-2020 the FFmpeg developers built with gcc 7 (Ubuntu 7.5.0-3ubuntu1~18.04) configuration: --prefix=/usr --extra-version=0ubuntu0.2 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --enable-gpl --disable-stripping --enable-avresample --enable-avisynth --enable-gnutls --enable-ladspa --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librubberband --enable-librsvg --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvorbis --enable-libvpx --enable-libwavpack --enable-libwebp --enable-libx265 --enable-libxml2 --enable-libxvid --enable-libzmq --enable-libzvbi --enable-omx --enable-openal --enable-opengl --enable-sdl2 --enable-libdc1394 --enable-libdrm --enable-libiec61883 --enable-chromaprint --enable-frei0r --enable-libopencv --enable-libx264 --enable-shared libavutil 55. 78.100 / 55. 78.100 libavcodec 57.107.100 / 57.107.100 libavformat 57. 83.100 / 57. 83.100 libavdevice 57. 10.100 / 57. 10.100 libavfilter 6.107.100 / 6.107.100 libavresample 3. 7. 0 / 3. 7. 0 libswscale 4. 8.100 / 4. 8.100 libswresample 2. 9.100 / 2. 9.100 libpostproc 54. 7.100 / 54. 7.100 log/camel-5/camel-camel-5-10.mp4: No such file or directory log/camel-5/ camel/0 no mesh found camel/1 no mesh found camel/2 no mesh found camel/3 no mesh found camel/4 no mesh found camel/5 no mesh found camel/6 no mesh found camel/7 no mesh found camel/8 no mesh found camel/9 no mesh found camel/10 no mesh found camel/11 no mesh found camel/12 no mesh found camel/13 no mesh found camel/14 no mesh found camel/15 no mesh found camel/16 no mesh found camel/17 no mesh found camel/18 no mesh found camel/19 no mesh found camel/20 no mesh found camel/21 no mesh found camel/22 no mesh found camel/23 no mesh found camel/24 no mesh found camel/25 no mesh found camel/26 no mesh found camel/27 no mesh found camel/28 no mesh found camel/29 no mesh found camel/30 no mesh found camel/31 no mesh found camel/32 no mesh found camel/33 no mesh found camel/34 no mesh found camel/35 no mesh found camel/36 no mesh found camel/37 no mesh found camel/38 no mesh found camel/39 no mesh found camel/40 no mesh found camel/41 no mesh found camel/42 no mesh found camel/43 no mesh found camel/44 no mesh found camel/45 no mesh found camel/46 no mesh found camel/47 no mesh found camel/48 no mesh found camel/49 no mesh found camel/50 no mesh found camel/51 no mesh found camel/52 no mesh found camel/53 no mesh found camel/54 no mesh found camel/55 no mesh found camel/56 no mesh found camel/57 no mesh found camel/58 no mesh found camel/59 no mesh found camel/60 no mesh found camel/61 no mesh found camel/62 no mesh found camel/63 no mesh found camel/64 no mesh found camel/65 no mesh found camel/66 no mesh found camel/67 no mesh found camel/68 no mesh found camel/69 no mesh found camel/70 no mesh found camel/71 no mesh found camel/72 no mesh found camel/73 no mesh found camel/74 no mesh found camel/75 no mesh found camel/76 no mesh found camel/77 no mesh found camel/78 no mesh found camel/79 no mesh found camel/80 no mesh found camel/81 no mesh found camel/82 no mesh found camel/83 no mesh found camel/84 no mesh found camel/85 no mesh found camel/86 no mesh found camel/87 no mesh found camel/88 no mesh found database/DAVIS/JPEGImages/Full-Resolution/camel/00000.jpg Traceback (most recent call last): File "render_vis.py", line 292, in main() File "render_vis.py", line 207, in main refmesh = all_mesh[i] IndexError: list index out of range log/camel-5/ camel/0 no mesh found camel/1 no mesh found camel/2 no mesh found camel/3 no mesh found camel/4 no mesh found camel/5 no mesh found camel/6 no mesh found camel/7 no mesh found camel/8 no mesh found camel/9 no mesh found camel/10 no mesh found camel/11 no mesh found camel/12 no mesh found camel/13 no mesh found camel/14 no mesh found camel/15 no mesh found camel/16 no mesh found camel/17 no mesh found camel/18 no mesh found camel/19 no mesh found camel/20 no mesh found camel/21 no mesh found camel/22 no mesh found camel/23 no mesh found camel/24 no mesh found camel/25 no mesh found camel/26 no mesh found camel/27 no mesh found camel/28 no mesh found camel/29 no mesh found camel/30 no mesh found camel/31 no mesh found camel/32 no mesh found camel/33 no mesh found camel/34 no mesh found camel/35 no mesh found camel/36 no mesh found camel/37 no mesh found camel/38 no mesh found camel/39 no mesh found camel/40 no mesh found camel/41 no mesh found camel/42 no mesh found camel/43 no mesh found camel/44 no mesh found camel/45 no mesh found camel/46 no mesh found camel/47 no mesh found camel/48 no mesh found camel/49 no mesh found camel/50 no mesh found camel/51 no mesh found camel/52 no mesh found camel/53 no mesh found camel/54 no mesh found camel/55 no mesh found camel/56 no mesh found camel/57 no mesh found camel/58 no mesh found camel/59 no mesh found camel/60 no mesh found camel/61 no mesh found camel/62 no mesh found camel/63 no mesh found camel/64 no mesh found camel/65 no mesh found camel/66 no mesh found camel/67 no mesh found camel/68 no mesh found camel/69 no mesh found camel/70 no mesh found camel/71 no mesh found camel/72 no mesh found camel/73 no mesh found camel/74 no mesh found camel/75 no mesh found camel/76 no mesh found camel/77 no mesh found camel/78 no mesh found camel/79 no mesh found camel/80 no mesh found camel/81 no mesh found camel/82 no mesh found camel/83 no mesh found camel/84 no mesh found camel/85 no mesh found camel/86 no mesh found camel/87 no mesh found camel/88 no mesh found 0 Traceback (most recent call last): File "render_vis.py", line 292, in main() File "render_vis.py", line 187, in main refmesh = all_mesh[0] IndexError: list index out of range log/camel-5/ camel/0 no mesh found camel/1 no mesh found camel/2 no mesh found camel/3 no mesh found camel/4 no mesh found camel/5 no mesh found camel/6 no mesh found camel/7 no mesh found camel/8 no mesh found camel/9 no mesh found camel/10 no mesh found camel/11 no mesh found camel/12 no mesh found camel/13 no mesh found camel/14 no mesh found camel/15 no mesh found camel/16 no mesh found camel/17 no mesh found camel/18 no mesh found camel/19 no mesh found camel/20 no mesh found camel/21 no mesh found camel/22 no mesh found camel/23 no mesh found camel/24 no mesh found camel/25 no mesh found camel/26 no mesh found camel/27 no mesh found camel/28 no mesh found camel/29 no mesh found camel/30 no mesh found camel/31 no mesh found camel/32 no mesh found camel/33 no mesh found camel/34 no mesh found camel/35 no mesh found camel/36 no mesh found camel/37 no mesh found camel/38 no mesh found camel/39 no mesh found camel/40 no mesh found camel/41 no mesh found camel/42 no mesh found camel/43 no mesh found camel/44 no mesh found camel/45 no mesh found camel/46 no mesh found camel/47 no mesh found camel/48 no mesh found camel/49 no mesh found camel/50 no mesh found camel/51 no mesh found camel/52 no mesh found camel/53 no mesh found camel/54 no mesh found camel/55 no mesh found camel/56 no mesh found camel/57 no mesh found camel/58 no mesh found camel/59 no mesh found camel/60 no mesh found camel/61 no mesh found camel/62 no mesh found camel/63 no mesh found camel/64 no mesh found camel/65 no mesh found camel/66 no mesh found camel/67 no mesh found camel/68 no mesh found camel/69 no mesh found camel/70 no mesh found camel/71 no mesh found camel/72 no mesh found camel/73 no mesh found camel/74 no mesh found camel/75 no mesh found camel/76 no mesh found camel/77 no mesh found camel/78 no mesh found camel/79 no mesh found camel/80 no mesh found camel/81 no mesh found camel/82 no mesh found camel/83 no mesh found camel/84 no mesh found camel/85 no mesh found camel/86 no mesh found camel/87 no mesh found camel/88 no mesh found database/DAVIS/JPEGImages/Full-Resolution/camel/00000.jpg Traceback (most recent call last): File "render_vis.py", line 292, in main() File "render_vis.py", line 207, in main refmesh = all_mesh[i] IndexError: list index out of range

gengshan-y commented 3 years ago

You were attaching the output of the rendering scrips. Do you have the output of the optimization script?

bash scripts/template.sh camel
Kana-alt commented 3 years ago

Failed to execute.

bash scripts/template.sh camel


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


Jitting Chamfer 3D Jitting Chamfer 3D Loaded JIT 3D CUDA chamfer distance Loaded JIT 3D CUDA chamfer distance Traceback (most recent call last): File "optimize.py", line 59, in app.run(main) File "/home/kana/anaconda3/envs/lasr/lib/python3.8/site-packages/absl/app.py", line 303, in run _run_main(main, args) File "/home/kana/anaconda3/envs/lasr/lib/python3.8/site-packages/absl/app.py", line 251, in _run_main sys.exit(main(argv)) File "optimize.py", line 42, in main torch.distributed.init_process_group( File "/home/kana/anaconda3/envs/lasr/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 436, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "/home/kana/anaconda3/envs/lasr/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 179, in _env_rendezvous_handler store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout) RuntimeError: Address already in use Traceback (most recent call last): File "optimize.py", line 59, in app.run(main) File "/home/kana/anaconda3/envs/lasr/lib/python3.8/site-packages/absl/app.py", line 303, in run _run_main(main, args) File "/home/kana/anaconda3/envs/lasr/lib/python3.8/site-packages/absl/app.py", line 251, in _run_main sys.exit(main(argv)) File "optimize.py", line 40, in main torch.cuda.set_device(opts.local_rank) File "/home/kana/anaconda3/envs/lasr/lib/python3.8/site-packages/torch/cuda/init.py", line 263, in set_device torch._C._cuda_setDevice(device) RuntimeError: CUDA error: invalid device ordinal Traceback (most recent call last): File "/home/kana/anaconda3/envs/lasr/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/kana/anaconda3/envs/lasr/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/kana/anaconda3/envs/lasr/lib/python3.8/site-packages/torch/distributed/launch.py", line 260, in main() File "/home/kana/anaconda3/envs/lasr/lib/python3.8/site-packages/torch/distributed/launch.py", line 255, in main raise subprocess.CalledProcessError(returncode=process.returncode, subprocess.CalledProcessError: Command '['/home/kana/anaconda3/envs/lasr/bin/python', '-u', 'optimize.py', '--local_rank=1', '--name=camel-0', '--checkpoint_dir', 'log/', '--only_mean_sym', '--nouse_gtpose', '--subdivide', '3', '--n_bones', '21', '--n_hypo', '16', '--num_epochs', '20', '--dataname', 'rcamel', '--sil_path', 'none', '--ngpu', '2', '--batch_size', '2', '--opt_tex', 'yes']' returned non-zero exit status 1.

gengshan-y commented 3 years ago

This is related to pytorch distributed data parallel: Seems the previous process hangs and occupies the port (and likely gpu memory) and you'll need to force kill them before launching a new process

pkill -f xxx

where xxx is a substring of the command that hangs https://github.com/NVIDIA/tacotron2/issues/181#issuecomment-481607690

gengshan-y commented 3 years ago

Overall, I'd suggest to use the pre-rendered spot data and run optimization again. The code was tested on different machines and I didn't observe similar issues. Let me know if it still produces nan loss.

Kana-alt commented 3 years ago

I ran the optimization again, using the pre-trained data. However, I get the following.

bash scripts/spot3.sh Jitting Chamfer 3D Loaded JIT 3D CUDA chamfer distance 1/0

workers: 1

pairs: 1

init:0, end:-1 198 paris of images Only the mean shape is symmetric! found mean v found tex found ctl rotation found rest translation found ctl points found log ctl running k-means on cuda:0.. [running kmeans]: 17it [00:00, 341.07it/s, center_shift=0.000000, iteration=18, tol=0.000100]running k-means on cuda:0.. running k-means on cuda:0.. [running kmeans]: 13it [00:00, 337.02it/s, center_shift=0.000000, iteration=14, tol=0.000100]running k-means on cuda:0.. running k-means on cuda:0.. [running kmeans]: 13it [00:00, 335.13it/s, center_shift=0.000000, iteration=14, tol=0.000100]running k-means on cuda:0.. running k-means on cuda:0.. [running kmeans]: 20it [00:00, 340.19it/s, center_shift=0.000000, iteration=21, tol=0.000100]running k-means on cuda:0.. new bone locations scores:g kmeans]: 14it [00:00, 350.38it/s, center_shift=0.000000, iteration=15, tol=0.000100] tensor([0., 0., 0., 0., 0., 0., 0., 0.], device='cuda:0') selecting 0eans]: 13it [00:00, 340.97it/s, center_shift=0.000000, iteration=14, tol=0.000100] [running kmeans]: 18it [00:00, 44.70it/s, center_shift=0.000000, iteration=18, tol=0.000100] [running kmeans]: 14it [00:00, 39.79it/s, center_shift=0.000000, iteration=14, tol=0.000100] [running kmeans]: 14it [00:00, 44.92it/s, center_shift=0.000000, iteration=14, tol=0.000100] [running kmeans]: 21it [00:00, 77.39it/s, center_shift=0.000000, iteration=21, tol=0.000100] [running kmeans]: 15it [00:00, 71.10it/s, center_shift=0.000000, iteration=15, tol=0.000100] [running kmeans]: 14it [00:00, 82.74it/s, center_shift=0.000000, iteration=14, tol=0.000100] [running kmeans]: 19it [00:00, 146.67it/s, center_shift=0.000000, iteration=19, tol=0.000100] [running kmeans]: 26it [00:00, 347.56it/s, center_shift=0.000000, iteration=26, tol=0.000100] /home/kana/anaconda3/envs/lasr/lib/python3.8/site-packages/kornia/geometry/conversions.py:506: UserWarning: XYZW quaternion coefficient order is deprecated and will be removed after > 0.6. Please use QuaternionCoeffOrder.WXYZ instead. warnings.warn("XYZW quaternion coefficient order is deprecated and" /home/kana/anaconda3/envs/lasr/lib/python3.8/site-packages/torch/nn/functional.py:3385: UserWarning: Default grid_sample and affine_grid behavior has changed to align_corners=False since 1.3.0. Please specify align_corners=True if the old behavior is desired. See the documentation of grid_sample for details. warnings.warn("Default grid_sample and affine_grid behavior has changed "

/home/kana/lasr3/nnutils/train_utils.py(295)train() -> self.optimizer.step() (Pdb)

Kana-alt commented 3 years ago

I'm sorry. I forgot that I was getting the following error when building docker. Is it possible that this is the cause?

Since I got this error, I am using Build with conda to build the environment.

sudo docker build --tag lasr:latest -f docker/Dockerfile ./ Sending build context to Docker daemon 20.41MB Step 1/15 : FROM nvidia/cuda:11.0-devel-ubuntu18.04 ---> d89f75c1799d Step 2/15 : ENV CONDA_DIR /anaconda3 ---> Using cache ---> 952b951d6c2c Step 3/15 : COPY third_party/softras /workspace/softras ---> Using cache ---> a28e87572e82 Step 4/15 : COPY lasr.yml /workspace/lasr.yml ---> Using cache ---> 34095cc757ec Step 5/15 : RUN apt-get update -q ---> Running in 6f3373585c02 Ign:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 InRelease Ign:2 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64 InRelease Err:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 Release Could not handshake: Error in the pull function. [IP: 152.199.39.144 443] Err:4 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64 Release Could not handshake: Error in the pull function. [IP: 152.199.39.144 443] Err:5 http://security.ubuntu.com/ubuntu bionic-security InRelease Connection failed [IP: 91.189.91.39 80] Err:6 http://archive.ubuntu.com/ubuntu bionic InRelease Connection failed [IP: 91.189.88.142 80] Err:7 http://archive.ubuntu.com/ubuntu bionic-updates InRelease Connection failed [IP: 91.189.88.142 80] Err:8 http://archive.ubuntu.com/ubuntu bionic-backports InRelease Connection failed [IP: 91.189.88.142 80] Reading package lists... E: The repository 'https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 Release' does not have a Release file. E: The repository 'https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64 Release' does not have a Release file. The command '/bin/sh -c apt-get update -q' returned a non-zero code: 100

gengshan-y commented 3 years ago

Since I'm not able to reproduce this issue, I can only give suggestions for debugging here.

To further debug the cause of total_loss being nan, I would print out the values of individual losses in the forward pass of the model, and check where it becomes nan , specifically from mask_loss to the auxiliary losses.

Given the observations, my best guess is that the versions of packages went wrong. This is not likely to happen with conda environment. Please let me know if conda build works for you.

Kana-alt commented 3 years ago

conda build is working properly.

Kana-alt commented 3 years ago

If you try again from "Data preparation", you will see the following output after running auto_gen.py. Could this be related to the problem?

Traceback (most recent call last): File "preprocess/auto_gen.py", line 191, in main() File "preprocess/auto_gen.py", line 155, in main print('%s/%s'%(test_left_img[inx],test_left_img[jnx])) IndexError: list index out of range

gengshan-y commented 3 years ago

Were you able to run optimization on spot? You don't need to run auto_gen.py for optimization on spot.

Kana-alt commented 3 years ago

No, I get the same error as camel.

gengshan-y commented 3 years ago

I'm still confused about how to reproduce your issue. If you plan to investigate further, please make a summary about the issue here or in a new thread. Please include your system hardware, conda&pip package list if you use conda install, and steps to reproduce your issue.

Kana-alt commented 3 years ago

This is my experimental environment.

Nvidia Driver Version: 455.23.05 CUDA version 11.0 Ubuntu 18.04.5

●conda list _libgcc_mutex 0.1 main conda-forge absl-py 0.12.0 pypi_0 pypi antlr4-python3-runtime 4.8 pypi_0 pypi appdirs 1.4.4 pypi_0 pypi argon2-cffi 20.1.0 py38h27cfd23_1
async_generator 1.10 pyhd3eb1b0_0
attrs 21.2.0 pyhd3eb1b0_0
backcall 0.2.0 pyhd3eb1b0_0
black 21.4b2 pypi_0 pypi blas 1.0 mkl conda-forge bleach 3.3.0 pyhd3eb1b0_0
ca-certificates 2021.5.25 h06a4308_1
cachetools 4.2.2 pypi_0 pypi certifi 2021.5.30 py38h06a4308_0
cffi 1.14.5 py38h261ae71_0
chardet 4.0.0 pypi_0 pypi click 8.0.1 pypi_0 pypi cloudpickle 1.6.0 pypi_0 pypi cudatoolkit 11.0.221 h6bb024c_0 anaconda cudatoolkit-dev 11.0.3 py38h7f98852_1 conda-forge cycler 0.10.0 pypi_0 pypi cython 0.29.23 pypi_0 pypi dbus 1.13.18 hb2f20db_0
decorator 5.0.9 pyhd3eb1b0_0
defusedxml 0.7.1 pyhd3eb1b0_0
entrypoints 0.3 py38_0
environment-kernels 1.1.1 pypi_0 pypi expat 2.4.1 h2531618_2
filelock 3.0.12 pypi_0 pypi fontconfig 2.13.1 h6c09931_0
freetype 2.10.4 h5ab3b9f_0 anaconda freetype-py 2.2.0 pypi_0 pypi future 0.18.2 pypi_0 pypi fvcore 0.1.5.post20210624 pypi_0 pypi gdown 3.13.0 pypi_0 pypi glib 2.68.2 h36276a3_0
google-auth 1.30.1 pypi_0 pypi google-auth-oauthlib 0.4.4 pypi_0 pypi grpcio 1.38.0 pypi_0 pypi gst-plugins-base 1.14.0 h8213a91_2
gstreamer 1.14.0 h28cd5cc_2
hydra-core 1.1.0 pypi_0 pypi icu 58.2 he6710b0_3
idna 2.10 pypi_0 pypi imageio 2.9.0 pypi_0 pypi importlib-metadata 3.10.0 py38h06a4308_0
importlib-resources 5.2.0 pypi_0 pypi importlib_metadata 3.10.0 hd3eb1b0_0
intel-openmp 2021.2.0 h06a4308_610
iopath 0.1.8 py38 iopath ipykernel 5.3.4 py38h5ca1d4c_0
ipython 7.22.0 py38hb070fc8_0
ipython_genutils 0.2.0 pyhd3eb1b0_1
ipywidgets 7.6.3 pyhd3eb1b0_1
jedi 0.17.0 py38_0
jinja2 3.0.1 pyhd3eb1b0_0
jpeg 9b h024ee3a_2
jsonschema 3.2.0 py_2
jupyter 1.0.0 py38_7
jupyter_client 6.1.12 pyhd3eb1b0_0
jupyter_console 6.4.0 pyhd3eb1b0_0
jupyter_core 4.7.1 py38h06a4308_0
jupyterlab_pygments 0.1.2 py_0
jupyterlab_widgets 1.0.0 pyhd3eb1b0_1
kiwisolver 1.3.1 pypi_0 pypi kmeans-pytorch 0.3 pypi_0 pypi kornia 0.5.3 pypi_0 pypi lcms2 2.12 h3be6417_0
ld_impl_linux-64 2.33.1 h53a641e_7 conda-forge libffi 3.3 he6710b0_2 anaconda libgcc-ng 9.1.0 hdf63c60_0 anaconda libpng 1.6.37 hbc83047_0 anaconda libsodium 1.0.18 h7b6447c_0
libstdcxx-ng 9.1.0 hdf63c60_0 anaconda libtiff 4.2.0 h85742a9_0
libuuid 1.0.3 h1bed415_2
libuv 1.40.0 h7b6447c_0 anaconda libwebp-base 1.2.0 h27cfd23_0
libxcb 1.14 h7b6447c_0
libxml2 2.9.10 hb55368b_3
lz4-c 1.9.3 h2531618_0
markdown 3.3.4 pypi_0 pypi markupsafe 2.0.1 py38h27cfd23_0
matplotlib 3.4.2 pypi_0 pypi mistune 0.8.4 py38h7b6447c_1000
mkl 2021.2.0 h06a4308_296
mkl-service 2.3.0 py38h27cfd23_1
mkl_fft 1.3.0 py38h42c9631_2
mkl_random 1.2.1 py38ha9443f7_2
mypy-extensions 0.4.3 pypi_0 pypi nbclient 0.5.3 pyhd3eb1b0_0
nbconvert 6.1.0 py38h06a4308_0
nbformat 5.1.3 pyhd3eb1b0_0
ncurses 6.2 he6710b0_1 anaconda nest-asyncio 1.5.1 pyhd3eb1b0_0
networkx 2.6rc1 pypi_0 pypi ninja 1.10.2 hff7bd54_1
notebook 6.4.0 py38h06a4308_0
numpy 1.20.2 py38h2d18471_0
numpy-base 1.20.2 py38hfae3a4d_0
oauthlib 3.1.1 pypi_0 pypi olefile 0.46 py_0 conda-forge omegaconf 2.1.0 pypi_0 pypi opencv-python 4.4.0.46 pypi_0 pypi openssl 1.1.1k h27cfd23_0
packaging 20.9 pyhd3eb1b0_0
pandas 1.2.4 pypi_0 pypi pandocfilters 1.4.3 py38h06a4308_1
parso 0.8.2 pyhd3eb1b0_0
pathspec 0.8.1 pypi_0 pypi pcre 8.45 h295c915_0
pexpect 4.8.0 pyhd3eb1b0_3
pickleshare 0.7.5 pyhd3eb1b0_1003
pillow 8.2.0 py38he98fc37_0
pip 21.1.1 py38h06a4308_0
portalocker 1.7.0 py38h578d9bd_1 conda-forge prometheus_client 0.11.0 pyhd3eb1b0_0
prompt-toolkit 3.0.17 pyh06a4308_0
prompt_toolkit 3.0.17 hd3eb1b0_0
protobuf 3.17.2 pypi_0 pypi ptyprocess 0.7.0 pyhd3eb1b0_2
pyasn1 0.4.8 pypi_0 pypi pyasn1-modules 0.2.8 pypi_0 pypi pycocotools 2.0.2 pypi_0 pypi pycparser 2.20 py_2
pydot 1.4.2 pypi_0 pypi pyglet 1.5.17 pypi_0 pypi pygments 2.9.0 pyhd3eb1b0_0
pyopengl 3.1.0 pypi_0 pypi pyparsing 3.0.0b2 pypi_0 pypi pypng 0.0.20 pypi_0 pypi pyqt 5.9.2 py38h05f1152_4
pyrender 0.1.45 pypi_0 pypi pyrsistent 0.17.3 py38h7b6447c_0
pysocks 1.7.1 pypi_0 pypi python 3.8.10 hdb3f193_7
python-dateutil 2.8.1 pyhd3eb1b0_0
python_abi 3.8 1_cp38 conda-forge pytorch 1.7.1 py3.8_cuda11.0.221_cudnn8.0.5_0 pytorch pytorch3d 0.4.0 py38_cu110_pyt171 pytorch3d pytz 2021.1 pypi_0 pypi pywavelets 1.1.1 pypi_0 pypi pyyaml 5.3.1 py38h8df0ef7_1 conda-forge pyzmq 20.0.0 py38h2531618_1
qt 5.9.7 h5867ecd_1
qtconsole 5.1.0 pyhd3eb1b0_0
qtpy 1.9.0 py_0
readline 8.1 h27cfd23_0
regex 2021.4.4 pypi_0 pypi requests 2.25.1 pypi_0 pypi requests-oauthlib 1.3.0 pypi_0 pypi rsa 4.7.2 pypi_0 pypi scikit-image 0.18.2rc2 pypi_0 pypi scipy 1.6.3 pypi_0 pypi send2trash 1.5.0 pyhd3eb1b0_1
setuptools 52.0.0 py38h06a4308_0
sip 4.19.13 py38he6710b0_0
six 1.15.0 py38h06a4308_0
soft-renderer 1.0.0 pypi_0 pypi sqlite 3.35.4 hdfb4753_0
tabulate 0.8.9 pyhd8ed1ab_0 conda-forge tensorboard 2.5.0 pypi_0 pypi tensorboard-data-server 0.6.1 pypi_0 pypi tensorboard-plugin-wit 1.8.0 pypi_0 pypi termcolor 1.1.0 py_2 conda-forge terminado 0.9.4 py38h06a4308_0
testpath 0.5.0 pyhd3eb1b0_0
tifffile 2021.4.8 pypi_0 pypi tk 8.6.10 hbc83047_0 anaconda toml 0.10.2 pypi_0 pypi torchvision 0.8.2 py38_cu110 pytorch tornado 6.1 py38h27cfd23_0
tqdm 4.61.0 pyhd8ed1ab_0 conda-forge traitlets 5.0.5 pyhd3eb1b0_0
trimesh 3.9.20 pypi_0 pypi typing_extensions 3.7.4.3 pyha847dfd_0
urllib3 1.26.5 pypi_0 pypi wcwidth 0.2.5 py_0
webencodings 0.5.1 py38_1
werkzeug 2.0.1 pypi_0 pypi wheel 0.36.2 pyhd3eb1b0_0
widgetsnbextension 3.5.1 py38_0
xz 5.2.5 h7b6447c_0 anaconda yacs 0.1.6 py_0 conda-forge yaml 0.2.5 h516909a_0 conda-forge zeromq 4.3.4 h2531618_0
zipp 3.4.1 pyhd3eb1b0_0
zlib 1.2.11 h7b6447c_3 anaconda zstd 1.4.9 haebb681_0

●pip list Package Version


absl-py 0.12.0 antlr4-python3-runtime 4.8 appdirs 1.4.4 argon2-cffi 20.1.0 async-generator 1.10 attrs 21.2.0 backcall 0.2.0 black 21.4b2 bleach 3.3.0 cachetools 4.2.2 certifi 2021.5.30 cffi 1.14.5 chardet 4.0.0 click 8.0.1 cloudpickle 1.6.0 cycler 0.10.0 Cython 0.29.23 decorator 5.0.9 defusedxml 0.7.1 entrypoints 0.3 environment-kernels 1.1.1 filelock 3.0.12 freetype-py 2.2.0 future 0.18.2 fvcore 0.1.5.post20210624 gdown 3.13.0 google-auth 1.30.1 google-auth-oauthlib 0.4.4 grpcio 1.38.0 hydra-core 1.1.0 idna 2.10 imageio 2.9.0 importlib-metadata 3.10.0 importlib-resources 5.2.0 iopath 0.1.8 ipykernel 5.3.4 ipython 7.22.0 ipython-genutils 0.2.0 ipywidgets 7.6.3 jedi 0.17.0 Jinja2 3.0.1 jsonschema 3.2.0 jupyter 1.0.0 jupyter-client 6.1.12 jupyter-console 6.4.0 jupyter-core 4.7.1 jupyterlab-pygments 0.1.2 jupyterlab-widgets 1.0.0 kiwisolver 1.3.1 kmeans-pytorch 0.3 kornia 0.5.3 Markdown 3.3.4 MarkupSafe 2.0.1 matplotlib 3.4.2 mistune 0.8.4 mkl-fft 1.3.0 mkl-random 1.2.1 mkl-service 2.3.0 mypy-extensions 0.4.3 nbclient 0.5.3 nbconvert 6.1.0 nbformat 5.1.3 nest-asyncio 1.5.1 networkx 2.6rc1 notebook 6.4.0 numpy 1.20.2 oauthlib 3.1.1 olefile 0.46 omegaconf 2.1.0 opencv-python 4.4.0.46 packaging 20.9 pandas 1.2.4 pandocfilters 1.4.3 parso 0.8.2 pathspec 0.8.1 pexpect 4.8.0 pickleshare 0.7.5 Pillow 8.2.0 pip 21.1.1 portalocker 1.7.0 prometheus-client 0.11.0 prompt-toolkit 3.0.17 protobuf 3.17.2 ptyprocess 0.7.0 pyasn1 0.4.8 pyasn1-modules 0.2.8 pycocotools 2.0.2 pycparser 2.20 pydot 1.4.2 pyglet 1.5.17 Pygments 2.9.0 PyOpenGL 3.1.0 pyparsing 3.0.0b2 pypng 0.0.20 pyrender 0.1.45 pyrsistent 0.17.3 PySocks 1.7.1 python-dateutil 2.8.1 pytorch3d 0.4.0 pytz 2021.1 PyWavelets 1.1.1 PyYAML 5.3.1 pyzmq 20.0.0 qtconsole 5.1.0 QtPy 1.9.0 regex 2021.4.4 requests 2.25.1 requests-oauthlib 1.3.0 rsa 4.7.2 scikit-image 0.18.2rc2 scipy 1.6.3 Send2Trash 1.5.0 setuptools 52.0.0.post20210125 sip 4.19.13 six 1.15.0 soft-renderer 1.0.0 tabulate 0.8.9 tensorboard 2.5.0 tensorboard-data-server 0.6.1 tensorboard-plugin-wit 1.8.0 termcolor 1.1.0 terminado 0.9.4 testpath 0.5.0 tifffile 2021.4.8 toml 0.10.2 torch 1.7.1 torchvision 0.8.2 tornado 6.1 tqdm 4.61.0 traitlets 5.0.5 trimesh 3.9.20 typing-extensions 3.7.4.3 urllib3 1.26.5 wcwidth 0.2.5 webencodings 0.5.1 Werkzeug 2.0.1 wheel 0.36.2 widgetsnbextension 3.5.1 yacs 0.1.6 zipp 3.4.1

gengshan-y commented 3 years ago

Those looks correct. What GPU card were you using ?

Kana-alt commented 3 years ago

I use TITAN X (Pascal)

gengshan-y commented 3 years ago

Ok, I've validated on almost the same setting as yours. If the same error exists, the best suggestion I can give is to print out the variable self.total_loss in the nnutils/mesh_net.py as mentioned earlier, and see where it becomes nan. For example, add print(self.total_loss) after this line. It should be a valid number if everything is correct.

Kana-alt commented 3 years ago

I inserted print(self.total_loss) and ran bash scripts/spot3.sh. When I continue in PDB mode, I get the following output.

/home/shiori/lasr/nnutils/train_utils.py(295)train() -> self.optimizer.step() (Pdb) c /home/shiori/lasr/nnutils/train_utils.py:76: RuntimeWarning: invalid value encountered in true_divide timg = (timg-timg.min())/(timg.max()-timg.min()) tensor(0.1620, device='cuda:0', grad_fn=) tensor(nan, device='cuda:0', grad_fn=) /home/shiori/lasr/nnutils/train_utils.py(294)train() -> pdb.set_trace() (Pdb) c tensor(0.1632, device='cuda:0', grad_fn=) /home/shiori/lasr/nnutils/train_utils.py(295)train() -> self.optimizer.step()

gengshan-y commented 3 years ago

Thanks. The code block l382-l522 in nnutils/mesh_net.py sequentially computes and adds the following losses: (1) mask loss, (2) flow loss, (3) rgb loss, (4) shape loss (5) deformation loss (6) bone symmetry loss (7) camera loss (8) auxiliary losses.

Given the information you provided, self.total_loss is valid after (1) but becomes nan after (8). To figure out when it becomes nan, you could print out the value of self.total_loss after adding each loss.

Kana-alt commented 3 years ago

Thank you. It was nan from the output of (3).

/home/kana/anaconda3/envs/lasr/lib/python3.8/site-packages/torch/nn/functional.py:3385: UserWarning: Default grid_sample and affine_grid behavior has changed to align_corners=False since 1.3.0. Please specify align_corners=True if the old behavior is desired. See the documentation of grid_sample for details. warnings.warn("Default grid_sample and affine_grid behavior has changed " tensor(nan, device='cuda:0', grad_fn=)

/home/kana/lasr/nnutils/train_utils.py(295)train() -> self.optimizer.step() (Pdb) c /home/kana/lasr/nnutils/train_utils.py:76: RuntimeWarning: invalid value encountered in true_divide timg = (timg-timg.min())/(timg.max()-timg.min()) tensor(0.1620, device='cuda:0', grad_fn=) tensor(0.1620, device='cuda:0', grad_fn=) tensor(nan, device='cuda:0', grad_fn=) tensor(nan, device='cuda:0', grad_fn=) tensor(nan, device='cuda:0', grad_fn=) tensor(nan, device='cuda:0', grad_fn=) tensor(nan, device='cuda:0', grad_fn=) tensor(nan, device='cuda:0', grad_fn=) /home/kana/lasr/nnutils/train_utils.py(294)train() -> pdb.set_trace()

gengshan-y commented 3 years ago

Hi, are you able to further localize which line produces invalid value, that made the self.total_loss become nan?

Kana-alt commented 3 years ago

The output of self.texture_loss_sub and self.texture_loss is as follows.

self.texture_loss_sub : tensor([[ nan, nan, nan, nan, nan, nan, 0.1660, 0.1662], [0.1692, 0.1693, 0.1693, 0.1695, 0.1692, 0.1693, 0.1694, 0.1695]], device='cuda:0', grad_fn=) self.texture_loss : tensor(nan, device='cuda:0', grad_fn=) texture_loss: tensor(nan, device='cuda:0', grad_fn=)

gengshan-y commented 3 years ago

Thanks, the self.texture_loss_sub is the sum of (1) rgb loss and (2) perceptual loss, it would help to localize further.

Kana-alt commented 3 years ago

Thank you. I got the following output.

rgb_loss: tensor(nan, device='cuda:0', grad_fn=) rgb_loss: tensor(nan, device='cuda:0', grad_fn=) rgb_loss: tensor(nan, device='cuda:0', grad_fn=) rgb_loss: tensor(nan, device='cuda:0', grad_fn=) rgb_loss: tensor(nan, device='cuda:0', grad_fn=) rgb_loss: tensor(nan, device='cuda:0', grad_fn=) rgb_loss: tensor(0.2586, device='cuda:0', grad_fn=) rgb_loss: tensor(0.2586, device='cuda:0', grad_fn=) rgb_loss: tensor(0.2601, device='cuda:0', grad_fn=) rgb_loss: tensor(0.2601, device='cuda:0', grad_fn=) rgb_loss: tensor(0.2601, device='cuda:0', grad_fn=) rgb_loss: tensor(0.2601, device='cuda:0', grad_fn=) rgb_loss: tensor(0.2601, device='cuda:0', grad_fn=) rgb_loss: tensor(0.2601, device='cuda:0', grad_fn=) rgb_loss: tensor(0.2601, device='cuda:0', grad_fn=) rgb_loss: tensor(0.2601, device='cuda:0', grad_fn=)

percept_loss: tensor([ nan, nan, nan, nan, nan, nan, 2.1181, 2.1905, 2.1126, 2.1569, 2.1562, 2.1891, 2.1188, 2.1328, 2.2336, 2.2318, nan, nan, nan, nan, nan, nan, 2.0471, 2.0863, 1.9113, 1.9823, 1.9965, 2.0729, 1.9108, 1.9763, 1.9714, 2.0456], device='cuda:0', grad_fn=)

gengshan-y commented 3 years ago

I'm not sure what happens. Could you add these lines before computing perceptual loss

data_save = {}
data_save['obspair'] =  obspair.detach().cpu().numpy()
data_save['rndpair'] = rndpair.detach().cpu().numpy()
data_save['verts_pre'] = verts_pre.detach().cpu().numpy()
data_save['faces']=faces.detach().cpu().numpy()
data_save['tex']=tex.detach().cpu().numpy()
np.save('./data.npy', data_save)

and share the saved data.npy file in the current folder with me? I will check whether it's a rendering problem or data loading problem.

Kana-alt commented 3 years ago

Since I don't know how to share npy files, the output is given below.

{'obspair': array([[[[0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], ..., [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.]],

    [[0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     ...,
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.]],

    [[0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     ...,
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.]]],

   [[[0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     ...,
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.]],

    [[0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     ...,
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.]],

    [[0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     ...,
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.]]],

   [[[0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     ...,
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.]],

    [[0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     ...,
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.]],

    [[0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     ...,
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.]]],

   ...,

   [[[1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.],
     ...,
     [1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.]],

    [[1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.],
     ...,
     [1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.]],

    [[1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.],
     ...,
     [1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.]]],

   [[[1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.],
     ...,
     [1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.]],

    [[1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.],
     ...,
     [1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.]],

    [[1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.],
     ...,
     [1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.]]],

   [[[1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.],
     ...,
     [1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.]],

    [[1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.],
     ...,
     [1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.]],

    [[1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.],
     ...,
     [1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.]]]], dtype=float32), 'rndpair': array([[[[0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     ...,
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.]],

    [[0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     ...,
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.]],

    [[0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     ...,
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.]]],

   [[[0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     ...,
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.]],

    [[0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     ...,
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.]],

    [[0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     ...,
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.]]],

   [[[0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     ...,
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.]],

    [[0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     ...,
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.]],

    [[0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     ...,
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.]]],

   ...,

   [[[1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.],
     ...,
     [1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.]],

    [[1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.],
     ...,
     [1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.]],

    [[1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.],
     ...,
     [1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.]]],

   [[[1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.],
     ...,
     [1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.]],

    [[1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.],
     ...,
     [1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.]],

    [[1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.],
     ...,
     [1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.]]],

   [[[1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.],
     ...,
     [1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.]],

    [[1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.],
     ...,
     [1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.]],

    [[1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.],
     ...,
     [1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.],
     [1., 1., 1., ..., 1., 1., 1.]]]], dtype=float32), 'verts_pre': array([[[  0.03669876,  -0.53712124,  19.928108  ],
    [  0.11251941,  -0.13512413,   9.785472  ],
    [ 11.271721  ,  -0.02641002,   8.839373  ],
    ...,
    [  1.0225617 , -11.714436  ,   8.412083  ],
    [ -0.10530925,  -0.42463654,  20.043804  ],
    [ -0.16680032,   0.17549148,   9.568409  ]],

   [[ 12.358375  ,   0.38589263,   8.524963  ],
    [  0.8437464 , -12.515985  ,   8.428447  ],
    [ -0.23052686,  -0.41224524,  20.325195  ],
    ...,
    [ -0.4429301 ,   0.1580073 ,   9.713904  ],
    [ 12.616431  ,   0.42210066,   8.628665  ],
    [  0.9909441 , -12.903456  ,   8.391498  ]],

   [[ -0.18003827,  -0.4161254 ,  19.86953   ],
    [ -0.44757855,   0.15658699,   9.562622  ],
    [ 11.271862  ,   0.41678354,   8.536382  ],
    ...,
    [  1.0186372 , -11.691986  ,   8.67606   ],
    [ -0.1390848 ,  -0.5128202 ,  20.368404  ],
    [ -0.08979135,   0.0807905 ,   9.777859  ]],

   ...,

   [[ 13.212509  , -14.096129  ,   8.077927  ],
    [ -0.12603344, -23.501102  ,   9.248735  ],
    [ -0.13886106,  -6.121924  ,  19.605312  ],
    ...,
    [  1.4911414 , -12.682963  ,   8.095443  ],
    [ 13.268127  , -13.988748  ,   8.102279  ],
    [  0.13536483, -23.306433  ,   9.242769  ]],

   [[ -0.31965658,  -5.2719135 ,  19.70742   ],
    [  0.8037931 , -10.754548  ,   8.323112  ],
    [ 10.802569  , -11.863282  ,   8.329947  ],
    ...,
    [ -0.17284162, -19.642536  ,   9.535197  ],
    [ -0.3779588 ,  -5.50803   ,  19.628155  ],
    [  0.5156373 , -10.347673  ,   8.8016    ]],

   [[ 10.569969  , -11.977585  ,   8.30729   ],
    [ -0.1654361 , -19.899996  ,   9.459713  ],
    [ -0.3769737 ,  -5.5397005 ,  19.55267   ],
    ...,
    [  0.62245774, -11.1361065 ,   8.21687   ],
    [ 10.590989  , -11.898343  ,   8.455078  ],
    [ -0.59070694, -20.045033  ,   9.4779825 ]]], dtype=float32), 'faces': array([[[105,  16, 410],
    [  8, 105, 410],
    [ 16, 413, 108],
    ...,
    [352, 546, 565],
    [391, 561, 546],
    [395, 561, 565]],

   [[105,  16, 410],
    [  8, 105, 410],
    [ 16, 413, 108],
    ...,
    [352, 546, 565],
    [391, 561, 546],
    [395, 561, 565]]]), 'tex': array([[[0.28400886, 0.39523867, 0.06636933],
    [0.53623235, 0.4698201 , 0.3589957 ],
    [0.34913328, 0.41863316, 0.2545403 ],
    ...,
    [0.75019354, 0.47404566, 0.6776529 ],
    [0.68552405, 0.5935845 , 0.2544528 ],
    [0.20894824, 0.70933825, 0.37567273]],

   [[0.28400886, 0.39523867, 0.06636933],
    [0.53623235, 0.4698201 , 0.3589957 ],
    [0.34913328, 0.41863316, 0.2545403 ],
    ...,
    [0.75019354, 0.47404566, 0.6776529 ],
    [0.68552405, 0.5935845 , 0.2544528 ],
    [0.20894824, 0.70933825, 0.37567273]],

   [[0.28400886, 0.39523867, 0.06636933],
    [0.53623235, 0.4698201 , 0.3589957 ],
    [0.34913328, 0.41863316, 0.2545403 ],
    ...,
    [0.75019354, 0.47404566, 0.6776529 ],
    [0.68552405, 0.5935845 , 0.2544528 ],
    [0.20894824, 0.70933825, 0.37567273]],

   ...,

   [[0.28400886, 0.39523867, 0.06636933],
    [0.53623235, 0.4698201 , 0.3589957 ],
    [0.34913328, 0.41863316, 0.2545403 ],
    ...,
    [0.75019354, 0.47404566, 0.6776529 ],
    [0.68552405, 0.5935845 , 0.2544528 ],
    [0.20894824, 0.70933825, 0.37567273]],

   [[0.28400886, 0.39523867, 0.06636933],
    [0.53623235, 0.4698201 , 0.3589957 ],
    [0.34913328, 0.41863316, 0.2545403 ],
    ...,
    [0.75019354, 0.47404566, 0.6776529 ],
    [0.68552405, 0.5935845 , 0.2544528 ],
    [0.20894824, 0.70933825, 0.37567273]],

   [[0.28400886, 0.39523867, 0.06636933],
    [0.53623235, 0.4698201 , 0.3589957 ],
    [0.34913328, 0.41863316, 0.2545403 ],
    ...,
    [0.75019354, 0.47404566, 0.6776529 ],
    [0.68552405, 0.5935845 , 0.2544528 ],
    [0.20894824, 0.70933825, 0.37567273]]], dtype=float32)}
gengshan-y commented 3 years ago

Can you share through google drive or email (gengshany@cmu.edu)?

Kana-alt commented 3 years ago

OK. The email has been sent.

gengshan-y commented 3 years ago

Hi, the problem I found is that the variable "verts_pre" contains identical values. To be more specific, verts_pre is the Nx3 matrix where each row is the view space (x,y,z) coordinate of a mesh vertex.

I don't know which part went wrong during the transformation from rest shape coordinate to view space. Could you share more information as follows?

import json
data_save = {}
data_save['pred_v'] =  pred_v.detach().cpu().numpy().tolist() # rest
data_save['verts_tex'] = verts_tex.detach().cpu().numpy().tolist() # camera
data_save['verts_pre'] = verts_pre.detach().cpu().numpy().tolist()  # view
data_save['offset'] = offset.detach().cpu().numpy().tolist()
with open('data.json', 'w') as f: json.dump(data_save, f)
Kana-alt commented 3 years ago

Thank you. I will share the information you mentioned.

https://drive.google.com/file/d/1fZXvDXO_QE6A2yGhdLH9h4AvWkHj3MoU/view?usp=sharing

2021年7月8日(木) 8:26 Gengshan Yang @.***>:

Hi, the problem I found is that the variable "verts_pre https://github.com/google/lasr/blob/a22780b533079befd29d0bc5e0f9ca8b95f43873/nnutils/mesh_net.py#L347" contains identical values. To be more specific, verts_pre is the Nx3 matrix where each row is the view space (x,y,z) coordinate of a mesh vertex.

I don't know which part went wrong during the transformation from rest shape coordinate to view space. Could you share more information as follows?

import json data_save = {} data_save['pred_v'] = pred_v.detach().cpu().numpy().tolist() # rest data_save['verts_tex'] = verts_tex.detach().cpu().numpy().tolist() # camera data_save['verts_pre'] = verts_pre.detach().cpu().numpy().tolist() # view data_save['offset'] = offset.detach().cpu().numpy().tolist() with open('data.json', 'w') as f: json.dump(data_save, f)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/google/lasr/issues/3#issuecomment-875999672, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUNDDDSG57XYI5DQAWP47RTTWTPBLANCNFSM47T55AMA .

gengshan-y commented 3 years ago

Thanks, we can nail down the problem to this block

The output after blend skinning and camera projection was wrong. I guess that's due to wrong transformation and camera parameters.

I will need the following information to further look into it.

import json
data_save = {}
data_save['Rmat_tex'] =  Rmat_tex.detach().cpu().numpy().tolist() # rotations
data_save['quat'] =  quat.detach().cpu().numpy().tolist() # rotations
data_save['Tmat'] =  Tmat.detach().cpu().numpy().tolist() # translation xyz
data_save['trans'] =  trans.detach().cpu().numpy().tolist() # translation xy
data_save['depth'] =  depth.detach().cpu().numpy().tolist() # translation z
data_save['ppoint'] =  ppoint.detach().cpu().numpy().tolist() # principle points
data_save['scale'] =  scale.detach().cpu().numpy().tolist() # focal length
data_save['cams'] =  self.cams.detach().cpu().numpy().tolist() # camera calibration
data_save['pp'] =  self.pp.detach().cpu().numpy().tolist() # camera calibration
with open('data.json', 'w') as f: json.dump(data_save, f)

Can you also share the checkpoint file log/spot3-0/pred_net_0.pth?

Kana-alt commented 3 years ago

Thank you.

I will share the two files below.

https://drive.google.com/drive/folders/123NpMzh_4ZJnsTO0TQ_DXGdzf0ViTvSK?usp=sharing

2021年7月10日(土) 4:37 Gengshan Yang @.***>:

Thanks, we can nail down the problem to this block https://github.com/google/lasr/blob/a22780b533079befd29d0bc5e0f9ca8b95f43873/nnutils/mesh_net.py#L343-L345

The output after blend skinning and camera projection was wrong. I guess that's due to wrong transformation and camera parameters.

I will need the following information to further look into it.

import json data_save = {} data_save['Rmat_tex'] = Rmat_tex.detach().cpu().numpy().tolist() # rotations data_save['quat'] = quat.detach().cpu().numpy().tolist() # rotations data_save['Tmat'] = Tmat.detach().cpu().numpy().tolist() # translation xyz data_save['trans'] = trans.detach().cpu().numpy().tolist() # translation xy data_save['depth'] = depth.detach().cpu().numpy().tolist() # translation z data_save['ppoint'] = ppoint.detach().cpu().numpy().tolist() # principle points data_save['scale'] = scale.detach().cpu().numpy().tolist() # focal length data_save['cams'] = self.cams.detach().cpu().numpy().tolist() # camera calibration data_save['pp'] = self.pp.detach().cpu().numpy().tolist() # camera calibration with open('data.json', 'w') as f: json.dump(data_save, f)

Can you also share the checkpoint file log/spot3-0/pred_net_0.pth?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/google/lasr/issues/3#issuecomment-877413701, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUNDDDW6EFLRPLCJVJZ2WMDTW5FXFANCNFSM47T55AMA .

gengshan-y commented 3 years ago

Thanks, the rotation, translation, focal length, and principal points all look correct. It has to be the problem of skinning weights. However, I'm still not able to reproduce the problem even with the checkpoint you provided.

Could you also provide skinning weights?

import json
data_save = {}
data_save['skin'] =  skin.detach().cpu().numpy().tolist() # skinning weights
data_save['ctl_ts'] = self.ctl_ts.detach().cpu().numpy().tolist() # bone centroid
data_save['ctl_rs'] = self.ctl_rs.detach().cpu().numpy().tolist() # bone orientation
data_save['log_ctl'] = self.log_ctl.detach().cpu().numpy().tolist() # bone scale
with open('data.json', 'w') as f: json.dump(data_save, f)
Kana-alt commented 3 years ago

Thank you.Share the file.

https://drive.google.com/drive/folders/123NpMzh_4ZJnsTO0TQ_DXGdzf0ViTvSK?usp=sharing

2021年7月10日(土) 13:17 Gengshan Yang @.***>:

Thanks, the rotation, translation, focal length, and principal points all look correct. It has to be the problem of skinning weights. However, I'm still not able to reproduce the problem even with the checkpoint you provided.

Could you also provide skinning weights?

import json data_save = {} data_save['skin'] = skin.detach().cpu().numpy().tolist() # skinning weights with open('data.json', 'w') as f: json.dump(data_save, f)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/google/lasr/issues/3#issuecomment-877561297, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUNDDDUR7IW2VINT2VPLGYDTW7CVFANCNFSM47T55AMA .

gengshan-y commented 3 years ago

The file does not contain all the variables listed. Could you check the updated code and share again?

Kana-alt commented 3 years ago

I'm sorry. I've updated it.

https://drive.google.com/drive/folders/123NpMzh_4ZJnsTO0TQ_DXGdzf0ViTvSK?usp=sharing

2021年7月11日(日) 4:07 Gengshan Yang @.***>:

The file does not contain all the variables listed. Could you check the updated code and share again?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/google/lasr/issues/3#issuecomment-877688546, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUNDDDRPE6AME2MJU3GEL5DTXCK5XANCNFSM47T55AMA .

gengshan-y commented 3 years ago

The rest bone location, orientation, scale all look correct. I guess it is a numerical issue. Could you try replace this line with

skin = (-10 * dis_norm.sum(3)).double().softmax(1).float()[:,:,:,None] # h,j,n,1

If it still does not fix the issue, can you share with me the following?

import json
data_save = {}
data_save['skin'] =  skin.detach().cpu().numpy().tolist() # skinning weights
data_save['ctl_ts'] = self.ctl_ts.detach().cpu().numpy().tolist() # bone centroid
data_save['ctl_rs'] = self.ctl_rs.detach().cpu().numpy().tolist() # bone orientation
data_save['log_ctl'] = self.log_ctl.detach().cpu().numpy().tolist() # bone scale
data_save['dis_norm'] = dis_norm.detach().cpu().numpy().tolist() # mahalanobis distance
with open('data.json', 'w') as f: json.dump(data_save, f)
Kana-alt commented 3 years ago

Thank you.

I fixed the code and ran it However, it did not solve the problem.

https://drive.google.com/drive/folders/123NpMzh_4ZJnsTO0TQ_DXGdzf0ViTvSK?usp=sharing

2021年7月11日(日) 15:15 Gengshan Yang @.***>:

The rest bone location, orientation, scale all look correct. I guess it is a numerical issue. Could you try replace this line https://github.com/google/lasr/blob/a22780b533079befd29d0bc5e0f9ca8b95f43873/nnutils/mesh_net.py#L260 with

skin = (-10 * dis_norm.sum(3)).double().softmax(1).float()[:,:,:,None] # h,j,n,1

If it still does not fix the issue, can you share with me the following?

import json data_save = {} data_save['skin'] = skin.detach().cpu().numpy().tolist() # skinning weights data_save['ctl_ts'] = self.ctl_ts.detach().cpu().numpy().tolist() # bone centroid data_save['ctl_rs'] = self.ctl_rs.detach().cpu().numpy().tolist() # bone orientation data_save['log_ctl'] = self.log_ctl.detach().cpu().numpy().tolist() # bone scale data_save['dis_norm'] = dis_norm.detach().cpu().numpy().tolist() # bone scale with open('data.json', 'w') as f: json.dump(data_save, f)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/google/lasr/issues/3#issuecomment-877748008, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUNDDDROEDULOCIWK5LEUVDTXEZI7ANCNFSM47T55AMA .

gengshan-y commented 3 years ago

What about replacing this line with

            from pytorch3d import transforms
            ctl_rs = torch.cat([self.ctl_rs[:,3:4], self.ctl_rs[:,:3]],-1)
            dis_norm = dis_norm.matmul(transforms.quaternion_to_matrix(ctl_rs).view(opts.n_hypo,-1,3,3)) # h,j,n,3
Kana-alt commented 3 years ago

Thank you. The result did not change. https://drive.google.com/drive/folders/123NpMzh_4ZJnsTO0TQ_DXGdzf0ViTvSK?usp=sharing

2021年7月11日(日) 15:58 Gengshan Yang @.***>:

What about replacing this line https://github.com/google/lasr/blob/a22780b533079befd29d0bc5e0f9ca8b95f43873/nnutils/mesh_net.py#L258 with

        from pytorch3d import transforms
        ctl_rs = torch.cat([self.ctl_rs[:,3:4], self.ctl_rs[:,:3]],-1)
        dis_norm = dis_norm.matmul(transforms.quaternion_to_matrix(ctl_rs).view(opts.n_hypo,-1,3,3)) # h,j,n,3

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/google/lasr/issues/3#issuecomment-877751833, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUNDDDTPF27PG3IHSTYA4VTTXE6HTANCNFSM47T55AMA .

gengshan-y commented 3 years ago

There is something apparently wrong in these lines that computs skinning weights, because you are getting dis_norm = 0.

dis_norm is the mahalanobis distance computed by (1) subtraction from bone position by vertex position (2) rotation and (3) scaling.

Can you debug a little bit and let me know when dis_norm becomes zeros?

Kana-alt commented 3 years ago

I ran the following code to output dis_norm. The output of (1) was not zero, does that mean that all outputs need to be zero?

        dis_norm = (self.ctl_ts.view(opts.n_hypo,-1,1,3) - pred_v.view(2*local_batch_size,opts.n_hypo,-1,3)[0,:,None].detach()) # p-v, H,J,1,3 - H,1,N,3
        print('(1)',dis_norm)
        #dis_norm = dis_norm.matmul(kornia.quaternion_to_rotation_matrix(self.ctl_rs).view(opts.n_hypo,-1,3,3)) # h,j,n,3
        from pytorch3d import transforms
        ctl_rs = torch.cat([self.ctl_rs[:,3:4], self.ctl_rs[:,:3]],-1)
        dis_norm = dis_norm.matmul(transforms.quaternion_to_matrix(ctl_rs).view(opts.n_hypo,-1,3,3)) # h,j,n,3
        print('(2)',dis_norm)
        dis_norm = self.log_ctl.exp().view(opts.n_hypo,-1,1,3) * dis_norm.pow(2) # (p-v)^TS(p-v)
        print('(3)',dis_norm)
gengshan-y commented 3 years ago

The output of dis_norm should not be zero in any case. Does the output of transforms.quaternion_to_matrix(ctl_rs) look reasonable? They should be identity matrices of shape 3x3.

Kana-alt commented 3 years ago

The outputs of ( 2) and (3 ) are both zero.  The output continues to look like the following.

tensor([[[[0., 0., 0.], [0., 0., 0.], [0., 0., 0.], ..., [0., 0., 0.], [0., 0., 0.], [0., 0., 0.]],

transforms.quaternion_to_matrix(ctl_rs) gives the following output, which I think is reasonable.

[[1., 0., 0.], [0., 1., 0.], [0., 0., 1.]],

Kana-alt commented 3 years ago

The following code was used to save the output of (2) to a json file. It turned out that zero was output. https://drive.google.com/file/d/1LqO9NTBZGmc9Qt0jppJqDVbkXc9JxSmb/view?usp=sharing

from pytorch3d import transforms ctl_rs = torch.cat([self.ctl_rs[:,3:4], self.ctl_rs[:,:3]],-1) dis_norm = dis_norm.matmul(transforms.quaternion_to_matrix(ctl_rs).view(opts.n_hypo,-1,3,3)) # h,j,n,3

        import json
        data_save = {}
        data_save['dis_norm.matmul(transforms.quaternion_to_matrix(ctl_rs).view(opts.n_hypo,-1,3,3))'] = dis_norm.matmul(transforms.quaternion_to_matrix(ctl_rs).view(opts.n_hypo,-1,3,3)).detach().cpu().numpy().tolist() # mahalanobis distance
        with open('data_2.json', 'w') as f: json.dump(data_save, f)

However, when I checked by saving dis_norm to a json file with the following code, it was not zero. https://drive.google.com/file/d/1yXH6COK_wgcOjejPtXe3rYXob-XRqRwU/view?usp=sharing

        dis_norm = self.log_ctl.exp().view(opts.n_hypo,-1,1,3) * dis_norm.pow(2) # (p-v)^TS(p-v)
        import json
        data_save = {}
        data_save['dis_norm'] =dis_norm.detach().cpu().numpy().tolist() 
        with open('data_3.json', 'w') as f: json.dump(data_save, f)
gengshan-y commented 3 years ago

Ok, multiplying an identity matrix should not change the values from non-zero to zero. Can you verify this? Something was wrong going from (1) to (2).

gengshan-y commented 3 years ago

The dimension of dis_norm should be HxBxNx3 before multiplication with the HxBx3x3 rotation matrix transforms.quaternion_to_matrix(ctl_rs). Maybe you could verify this by doing matrix multiplication at element level.

rotmat = transforms.quaternion_to_matrix(ctl_rs)
print(dis_norm[0,0].matmul(rotmat[0,0]))