Closed Kana-alt closed 3 years ago
Could you attach the full terminal outputs? This suggests loss goes to nan.
The output is as follows
$ bash scripts/spot3.sh Jitting Chamfer 3D Loaded JIT 3D CUDA chamfer distance 1/0
init:0, end:-1 198 paris of images Only the mean shape is symmetric! found mean v found tex found ctl rotation found rest translation found ctl points found log ctl running k-means on cuda:0.. [running kmeans]: 0it [00:00, ?it/s, center_shift=12.755209, iteration=1, tol=0.[running kmeans]: 1it [00:00, 160.34it/s, center_shift=2.029201, iteration=2, to[running kmeans]: 2it [00:00, 216.21it/s, center_shift=0.611189, iteration=3, to[running kmeans]: 3it [00:00, 236.17it/s, center_shift=0.375465, iteration=4, to[running kmeans]: 4it [00:00, 256.50it/s, center_shift=0.147063, iteration=5, to[running kmeans]: 5it [00:00, 271.09it/s, center_shift=0.066212, iteration=6, to[running kmeans]: 6it [00:00, 285.10it/s, center_shift=0.023095, iteration=7, to[running kmeans]: 7it [00:00, 290.22it/s, center_shift=0.012533, iteration=8, to[running kmeans]: 8it [00:00, 294.34it/s, center_shift=0.013412, iteration=9, to[running kmeans]: 9it [00:00, 289.23it/s, center_shift=0.000769, iteration=10, t[running kmeans]: 10it [00:00, 289.80it/s, center_shift=0.000823, iteration=11, [running kmeans]: 11it [00:00, 296.20it/s, center_shift=0.002879, iteration=12, [running kmeans]: 12it [00:00, 297.93it/s, center_shift=0.003466, iteration=13, [running kmeans]: 13it [00:00, 300.24it/s, center_shift=0.002981, iteration=14, [running kmeans]: 14it [00:00, 300.13it/s, center_shift=0.001947, iteration=15, [running kmeans]: 15it [00:00, 304.10it/s, center_shift=0.002470, iteration=16, [running kmeans]: 16it [00:00, 308.03it/s, center_shift=0.000000, iteration=17, tol=0.000100]running k-means on cuda:0..
[running kmeans]: 0it [00:00, ?it/s, center_shift=22.510870, iteration=1, tol=0.[running kmeans]: 1it [00:00, 151.02it/s, center_shift=2.741510, iteration=2, to[running kmeans]: 2it [00:00, 200.07it/s, center_shift=1.220065, iteration=3, to[running kmeans]: 3it [00:00, 234.44it/s, center_shift=0.853157, iteration=4, to[running kmeans]: 4it [00:00, 258.50it/s, center_shift=0.540455, iteration=5, to[running kmeans]: 5it [00:00, 266.42it/s, center_shift=0.377577, iteration=6, to[running kmeans]: 6it [00:00, 268.40it/s, center_shift=0.331676, iteration=7, to[running kmeans]: 7it [00:00, 277.28it/s, center_shift=0.322707, iteration=8, to[running kmeans]: 8it [00:00, 282.64it/s, center_shift=0.175176, iteration=9, to[running kmeans]: 9it [00:00, 288.53it/s, center_shift=0.111375, iteration=10, t[running kmeans]: 10it [00:00, 295.35it/s, center_shift=0.088879, iteration=11, [running kmeans]: 11it [00:00, 289.55it/s, center_shift=0.018423, iteration=12, [running kmeans]: 12it [00:00, 294.77it/s, center_shift=0.000564, iteration=13, [running kmearunning k-means on cuda:0..s, center_shift=0.000000, iteration=14, tol=0.000100]
[running kmeans]: 0it [00:00, ?it/s, center_shift=16.797604, iteration=1, tol=0.[running kmeans]: 1it [00:00, 156.77it/s, center_shift=2.243906, iteration=2, to[running kmeans]: 2it [00:00, 219.65it/s, center_shift=1.096395, iteration=3, to[running kmeans]: 3it [00:00, 254.32it/s, center_shift=0.756447, iteration=4, to[running kmeans]: 4it [00:00, 263.37it/s, center_shift=0.374705, iteration=5, to[running kmeans]: 5it [00:00, 276.21it/s, center_shift=0.398951, iteration=6, to[running kmeans]: 6it [00:00, 281.59it/s, center_shift=0.259081, iteration=7, to[running kmeans]: 7it [00:00, 288.73it/s, center_shift=0.234184, iteration=8, to[running kmeans]: 8it [00:00, 289.75it/s, center_shift=0.113684, iteration=9, to[running kmeans]: 9it [00:00, 283.74it/s, center_shift=0.077018, iteration=10, t[running kmeans]: 10it [00:00, 288.88it/s, center_shift=0.045436, iteration=11, [running kmeans]: 11it [00:00, 294.71it/s, center_shift=0.025897, iteration=12, [running kmeans]: 12it [00:00, 297.20it/s, center_shift=0.016986, iteration=13, [running kmeans]: 13it [00:00, 301.32it/s, center_shift=0.004283, iteration=14, [running kmeans]: 14it [00:00, 305.79it/s, center_shift=0.006879, iteration=15, [running kmeans]: 15it [00:00, 306.58it/s, center_shift=0.000763, iteration=16, [running kmearunning k-means on cuda:0..s, center_shift=0.000664, iteration=17, [running kmeans]: 17it [00:00, 305.54it/s, center_shift=0.000000, iteration=18, tol=0.000100]
[running kmeans]: 0it [00:00, ?it/s, center_shift=16.565763, iteration=1, tol=0.[running kmeans]: 1it [00:00, 160.76it/s, center_shift=1.930030, iteration=2, to[running kmeans]: 2it [00:00, 217.14it/s, center_shift=1.246738, iteration=3, to[running kmeans]: 3it [00:00, 253.97it/s, center_shift=0.760291, iteration=4, to[running kmeans]: 4it [00:00, 275.72it/s, center_shift=0.279815, iteration=5, to[running kmeans]: 5it [00:00, 289.73it/s, center_shift=0.157661, iteration=6, to[running kmeans]: 6it [00:00, 301.82it/s, center_shift=0.155566, iteration=7, to[running kmeans]: 7it [00:00, 284.76it/s, center_shift=0.048134, iteration=8, to[running kmeans]: 8it [00:00, 287.96it/s, center_shift=0.032563, iteration=9, to[running kmeans]: 9it [00:00, 294.43it/s, center_shift=0.038816, iteration=10, t[running kmeans]: 10it [00:00, 298.07it/s, center_shift=0.030720, iteration=11, [running kmeans]: 11it [00:00, 304.17it/s, center_shift=0.008083, iteration=12, [running kmeans]: 12it [00:00, 304.28it/s, center_shift=0.008150, iteration=13, [running kmeans]: 13it [00:00, 306.31it/s, center_shift=0.004026, iteration=14, [running kmeans]: 14it [00:00, 311.10it/s, center_shift=0.005496, iteration=15, [running kmearunning k-means on cuda:0..s, center_shift=0.009976, iteration=16, [running kmeans]: 16it [00:00, 313.87it/s, center_shift=0.000856, iteration=17, [running kmeans]: 17it [00:00, 316.71it/s, center_shift=0.000000, iteration=18, tol=0.000100]
[running kmeans]: 0it [00:00, ?it/s, center_shift=16.706129, iteration=1, tol=0.[running kmeans]: 1it [00:00, 175.88it/s, center_shift=4.650821, iteration=2, to[running kmeans]: 2it [00:00, 236.38it/s, center_shift=1.953743, iteration=3, to[running kmeans]: 3it [00:00, 266.52it/s, center_shift=0.789582, iteration=4, to[running kmeans]: 4it [00:00, 289.04it/s, center_shift=0.689171, iteration=5, to[running kmeans]: 5it [00:00, 292.76it/s, center_shift=0.236589, iteration=6, to[running kmeans]: 6it [00:00, 293.24it/s, center_shift=0.145930, iteration=7, to[running kmeans]: 7it [00:00, 300.79it/s, center_shift=0.031621, iteration=8, to[running kmeans]: 8it [00:00, 305.67it/s, center_shift=0.017276, iteration=9, to[running kmeans]: 9it [00:00, 301.45it/s, center_shift=0.020877, iteration=10, t[running kmeans]: 10it [00:00, 300.23it/s, center_shift=0.005033, iteration=11, [running kmeans]: 11it [00:00, 296.60it/s, center_shift=0.019400, iteration=12, [running kmeans]: 12it [00:00, 301.02it/s, center_shift=0.011062, iteration=13, [running kmearunning k-means on cuda:0..s, center_shift=0.003279, iteration=14, [running kmeans]: 14it [00:00, 305.77it/s, center_shift=0.003767, iteration=15, [running kmeans]: 15it [00:00, 304.70it/s, center_shift=0.000738, iteration=16, [running kmeans]: 16it [00:00, 307.97it/s, center_shift=0.000000, iteration=17, tol=0.000100]
[running kmeans]: 0it [00:00, ?it/s, center_shift=14.261992, iteration=1, tol=0.[running kmeans]: 1it [00:00, 146.41it/s, center_shift=2.185611, iteration=2, to[running kmeans]: 2it [00:00, 199.73it/s, center_shift=0.917551, iteration=3, to[running kmeans]: 3it [00:00, 219.22it/s, center_shift=0.308779, iteration=4, to[running kmeans]: 4it [00:00, 225.86it/s, center_shift=0.263299, iteration=5, to[running kmeans]: 5it [00:00, 244.03it/s, center_shift=0.155596, iteration=6, to[running kmeans]: 6it [00:00, 254.71it/s, center_shift=0.107818, iteration=7, to[running kmeans]: 7it [00:00, 267.17it/s, center_shift=0.053746, iteration=8, to[running kmeans]: 8it [00:00, 270.68it/s, center_shift=0.025596, iteration=9, to[running kmearunning k-means on cuda:0.., center_shift=0.012926, iteration=10, t[running kmeans]: 10it [00:00, 277.81it/s, center_shift=0.000798, iteration=11, [running kmeans]: 11it [00:00, 277.94it/s, center_shift=0.000676, iteration=12, [running kmeans]: 12it [00:00, 281.96it/s, center_shift=0.000821, iteration=13, [running kmeans]: 13it [00:00, 283.29it/s, center_shift=0.000000, iteration=14, tol=0.000100]
[running kmeans]: 0it [00:00, ?it/s, center_shift=15.424303, iteration=1, tol=0.[running kmeans]: 1it [00:00, 153.88it/s, center_shift=1.842625, iteration=2, to[running kmeans]: 2it [00:00, 212.78it/s, center_shift=0.643740, iteration=3, to[running kmeans]: 3it [00:00, 248.36it/s, center_shift=0.337034, iteration=4, to[running kmeans]: 4it [00:00, 254.30it/s, center_shift=0.257219, iteration=5, to[running kmeans]: 5it [00:00, 252.33it/s, center_shift=0.063849, iteration=6, to[running kmeans]: 6it [00:00, 263.81it/s, center_shift=0.025937, iteration=7, to[running kmeans]: 7it [00:00, 274.17it/s, center_shift=0.014644, iteration=8, to[running kmeans]: 8it [00:00, 283.65it/s, center_shift=0.003181, iteration=9, to[running kmeans]: 9it [00:00, 285.06it/s, center_shift=0.006797, iteration=10, t[running kmeans]: 10it [00:00, 291.43it/s, center_shift=0.007008, iteration=11, [running kmeans]: 11it [00:00, 297.45it/s, center_shift=0.002124, iteration=12, [running kmearunning k-means on cuda:0..s, center_shift=0.002736, iteration=13, [running kmeans]: 13it [00:00, 298.72it/s, center_shift=0.003066, iteration=14, [running kmeans]: 14it [00:00, 296.97it/s, center_shift=0.000581, iteration=15, [running kmeans]: 15it [00:00, 292.12it/s, center_shift=0.003025, iteration=16, [running kmeans]: 16it [00:00, 294.59it/s, center_shift=0.002441, iteration=17, [running kmeans]: 17it [00:00, 297.42it/s, center_shift=0.000000, iteration=18, tol=0.000100]
[running kmeans]: 0it [00:00, ?it/s, center_shift=21.720541, iteration=1, tol=0.[running kmeans]: 1it [00:00, 171.91it/s, center_shift=2.742992, iteration=2, to[running kmeans]: 2it [00:00, 209.94it/s, center_shift=1.623287, iteration=3, to[running kmeans]: 3it [00:00, 233.26it/s, center_shift=0.645541, iteration=4, to[running kmeans]: 4it [00:00, 250.92it/s, center_shift=0.422107, iteration=5, to[running kmeans]: 5it [00:00, 268.94it/s, center_shift=0.206953, iteration=6, to[running kmeans]: 6it [00:00, 272.19it/s, center_shift=0.064662, iteration=7, to[running kmeanew bone locations65.05it/s, center_shift=0.025331, iteration=8, toscores:g kmeans]: 8it [00:00, 274.59it/s, center_shift=0.060801, iteration=9, totensor([0., 0., 0., 0., 0., 0., 0., 0.], device='cuda:0')049627, iteration=10, tselecting 0eans]: 10it [00:00, 279.63it/s, center_shift=0.019994, iteration=11, [running kmeans]: 17it [00:00, 40.11it/s, center_shift=0.000000, iteration=17, tol=0.000100] ns]: 12it [00:00, 282.82it/s, center_shift=0.006282, iteration=13, [running kmeans]: 14it [00:00, 37.74it/s, center_shift=0.000000, iteration=14, tol=0.000100]
[running kmeans]: 18it [00:00, 55.43it/s, center_shift=0.000000, iteration=18, tol=0.000100]
[running kmeans]: 18it [00:00, 67.46it/s, center_shift=0.000000, iteration=18, tol=0.000100]
[running kmeans]: 17it [00:00, 81.03it/s, center_shift=0.000000, iteration=17, tol=0.000100]
[running kmeans]: 14it [00:00, 89.70it/s, center_shift=0.000000, iteration=14, tol=0.000100]
[running kmeans]: 18it [00:00, 166.93it/s, center_shift=0.000000, iteration=18, tol=0.000100]
[running kmeans]: 14it [00:00, 285.18it/s, center_shift=0.000000, iteration=14, tol=0.000100]
/home/kana/anaconda3/envs/lasr/lib/python3.8/site-packages/kornia/geometry/conversions.py:506: UserWarning: XYZW
quaternion coefficient order is deprecated and will be removed after > 0.6. Please use QuaternionCoeffOrder.WXYZ
instead.
warnings.warn("XYZW
quaternion coefficient order is deprecated and"
/home/kana/anaconda3/envs/lasr/lib/python3.8/site-packages/torch/nn/functional.py:3385: UserWarning: Default grid_sample and affine_grid behavior has changed to align_corners=False since 1.3.0. Please specify align_corners=True if the old behavior is desired. See the documentation of grid_sample for details.
warnings.warn("Default grid_sample and affine_grid behavior has changed "
/home/kana/lasr/nnutils/train_utils.py:76: RuntimeWarning: invalid value encountered in true_divide
timg = (timg-timg.min())/(timg.max()-timg.min())
/home/kana/lasr/nnutils/train_utils.py(295)train() -> self.optimizer.step() (Pdb)
Interesting, it stops at the first few iterations. This might be caused by improperly generated data. Were you able to run optimization for camel?
Update: I uploaded the pre-rendered spot data for you to compare.
No, Camel's optimization also fails.
bash scripts/render_result.sh camel
Setting up model..
loading log/camel-5/pred_net_10.pth..
Traceback (most recent call last):
File "extract.py", line 253, in
You were attaching the output of the rendering scrips. Do you have the output of the optimization script?
bash scripts/template.sh camel
Failed to execute.
bash scripts/template.sh camel
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
Jitting Chamfer 3D
Jitting Chamfer 3D
Loaded JIT 3D CUDA chamfer distance
Loaded JIT 3D CUDA chamfer distance
Traceback (most recent call last):
File "optimize.py", line 59, in
This is related to pytorch distributed data parallel: Seems the previous process hangs and occupies the port (and likely gpu memory) and you'll need to force kill them before launching a new process
pkill -f xxx
where xxx is a substring of the command that hangs https://github.com/NVIDIA/tacotron2/issues/181#issuecomment-481607690
Overall, I'd suggest to use the pre-rendered spot data and run optimization again. The code was tested on different machines and I didn't observe similar issues. Let me know if it still produces nan loss.
I ran the optimization again, using the pre-trained data. However, I get the following.
bash scripts/spot3.sh Jitting Chamfer 3D Loaded JIT 3D CUDA chamfer distance 1/0
init:0, end:-1
198 paris of images
Only the mean shape is symmetric!
found mean v
found tex
found ctl rotation
found rest translation
found ctl points
found log ctl
running k-means on cuda:0..
[running kmeans]: 17it [00:00, 341.07it/s, center_shift=0.000000, iteration=18, tol=0.000100]running k-means on cuda:0..
running k-means on cuda:0..
[running kmeans]: 13it [00:00, 337.02it/s, center_shift=0.000000, iteration=14, tol=0.000100]running k-means on cuda:0..
running k-means on cuda:0..
[running kmeans]: 13it [00:00, 335.13it/s, center_shift=0.000000, iteration=14, tol=0.000100]running k-means on cuda:0..
running k-means on cuda:0..
[running kmeans]: 20it [00:00, 340.19it/s, center_shift=0.000000, iteration=21, tol=0.000100]running k-means on cuda:0..
new bone locations
scores:g kmeans]: 14it [00:00, 350.38it/s, center_shift=0.000000, iteration=15, tol=0.000100]
tensor([0., 0., 0., 0., 0., 0., 0., 0.], device='cuda:0')
selecting 0eans]: 13it [00:00, 340.97it/s, center_shift=0.000000, iteration=14, tol=0.000100]
[running kmeans]: 18it [00:00, 44.70it/s, center_shift=0.000000, iteration=18, tol=0.000100]
[running kmeans]: 14it [00:00, 39.79it/s, center_shift=0.000000, iteration=14, tol=0.000100]
[running kmeans]: 14it [00:00, 44.92it/s, center_shift=0.000000, iteration=14, tol=0.000100]
[running kmeans]: 21it [00:00, 77.39it/s, center_shift=0.000000, iteration=21, tol=0.000100]
[running kmeans]: 15it [00:00, 71.10it/s, center_shift=0.000000, iteration=15, tol=0.000100]
[running kmeans]: 14it [00:00, 82.74it/s, center_shift=0.000000, iteration=14, tol=0.000100]
[running kmeans]: 19it [00:00, 146.67it/s, center_shift=0.000000, iteration=19, tol=0.000100]
[running kmeans]: 26it [00:00, 347.56it/s, center_shift=0.000000, iteration=26, tol=0.000100]
/home/kana/anaconda3/envs/lasr/lib/python3.8/site-packages/kornia/geometry/conversions.py:506: UserWarning: XYZW
quaternion coefficient order is deprecated and will be removed after > 0.6. Please use QuaternionCoeffOrder.WXYZ
instead.
warnings.warn("XYZW
quaternion coefficient order is deprecated and"
/home/kana/anaconda3/envs/lasr/lib/python3.8/site-packages/torch/nn/functional.py:3385: UserWarning: Default grid_sample and affine_grid behavior has changed to align_corners=False since 1.3.0. Please specify align_corners=True if the old behavior is desired. See the documentation of grid_sample for details.
warnings.warn("Default grid_sample and affine_grid behavior has changed "
/home/kana/lasr3/nnutils/train_utils.py(295)train() -> self.optimizer.step() (Pdb)
I'm sorry. I forgot that I was getting the following error when building docker. Is it possible that this is the cause?
Since I got this error, I am using Build with conda to build the environment.
sudo docker build --tag lasr:latest -f docker/Dockerfile ./ Sending build context to Docker daemon 20.41MB Step 1/15 : FROM nvidia/cuda:11.0-devel-ubuntu18.04 ---> d89f75c1799d Step 2/15 : ENV CONDA_DIR /anaconda3 ---> Using cache ---> 952b951d6c2c Step 3/15 : COPY third_party/softras /workspace/softras ---> Using cache ---> a28e87572e82 Step 4/15 : COPY lasr.yml /workspace/lasr.yml ---> Using cache ---> 34095cc757ec Step 5/15 : RUN apt-get update -q ---> Running in 6f3373585c02 Ign:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 InRelease Ign:2 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64 InRelease Err:3 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 Release Could not handshake: Error in the pull function. [IP: 152.199.39.144 443] Err:4 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64 Release Could not handshake: Error in the pull function. [IP: 152.199.39.144 443] Err:5 http://security.ubuntu.com/ubuntu bionic-security InRelease Connection failed [IP: 91.189.91.39 80] Err:6 http://archive.ubuntu.com/ubuntu bionic InRelease Connection failed [IP: 91.189.88.142 80] Err:7 http://archive.ubuntu.com/ubuntu bionic-updates InRelease Connection failed [IP: 91.189.88.142 80] Err:8 http://archive.ubuntu.com/ubuntu bionic-backports InRelease Connection failed [IP: 91.189.88.142 80] Reading package lists... E: The repository 'https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 Release' does not have a Release file. E: The repository 'https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64 Release' does not have a Release file. The command '/bin/sh -c apt-get update -q' returned a non-zero code: 100
Since I'm not able to reproduce this issue, I can only give suggestions for debugging here.
To further debug the cause of total_loss
being nan, I would print out the values of individual losses in the forward pass of the model, and check where it becomes nan , specifically from mask_loss to the auxiliary losses.
Given the observations, my best guess is that the versions of packages went wrong. This is not likely to happen with conda environment. Please let me know if conda build works for you.
conda build is working properly.
If you try again from "Data preparation", you will see the following output after running auto_gen.py. Could this be related to the problem?
Traceback (most recent call last):
File "preprocess/auto_gen.py", line 191, in
Were you able to run optimization on spot? You don't need to run auto_gen.py for optimization on spot.
No, I get the same error as camel.
I'm still confused about how to reproduce your issue. If you plan to investigate further, please make a summary about the issue here or in a new thread. Please include your system hardware, conda&pip package list if you use conda install, and steps to reproduce your issue.
This is my experimental environment.
Nvidia Driver Version: 455.23.05 CUDA version 11.0 Ubuntu 18.04.5
●conda list
_libgcc_mutex 0.1 main conda-forge
absl-py 0.12.0 pypi_0 pypi
antlr4-python3-runtime 4.8 pypi_0 pypi
appdirs 1.4.4 pypi_0 pypi
argon2-cffi 20.1.0 py38h27cfd23_1
async_generator 1.10 pyhd3eb1b0_0
attrs 21.2.0 pyhd3eb1b0_0
backcall 0.2.0 pyhd3eb1b0_0
black 21.4b2 pypi_0 pypi
blas 1.0 mkl conda-forge
bleach 3.3.0 pyhd3eb1b0_0
ca-certificates 2021.5.25 h06a4308_1
cachetools 4.2.2 pypi_0 pypi
certifi 2021.5.30 py38h06a4308_0
cffi 1.14.5 py38h261ae71_0
chardet 4.0.0 pypi_0 pypi
click 8.0.1 pypi_0 pypi
cloudpickle 1.6.0 pypi_0 pypi
cudatoolkit 11.0.221 h6bb024c_0 anaconda
cudatoolkit-dev 11.0.3 py38h7f98852_1 conda-forge
cycler 0.10.0 pypi_0 pypi
cython 0.29.23 pypi_0 pypi
dbus 1.13.18 hb2f20db_0
decorator 5.0.9 pyhd3eb1b0_0
defusedxml 0.7.1 pyhd3eb1b0_0
entrypoints 0.3 py38_0
environment-kernels 1.1.1 pypi_0 pypi
expat 2.4.1 h2531618_2
filelock 3.0.12 pypi_0 pypi
fontconfig 2.13.1 h6c09931_0
freetype 2.10.4 h5ab3b9f_0 anaconda
freetype-py 2.2.0 pypi_0 pypi
future 0.18.2 pypi_0 pypi
fvcore 0.1.5.post20210624 pypi_0 pypi
gdown 3.13.0 pypi_0 pypi
glib 2.68.2 h36276a3_0
google-auth 1.30.1 pypi_0 pypi
google-auth-oauthlib 0.4.4 pypi_0 pypi
grpcio 1.38.0 pypi_0 pypi
gst-plugins-base 1.14.0 h8213a91_2
gstreamer 1.14.0 h28cd5cc_2
hydra-core 1.1.0 pypi_0 pypi
icu 58.2 he6710b0_3
idna 2.10 pypi_0 pypi
imageio 2.9.0 pypi_0 pypi
importlib-metadata 3.10.0 py38h06a4308_0
importlib-resources 5.2.0 pypi_0 pypi
importlib_metadata 3.10.0 hd3eb1b0_0
intel-openmp 2021.2.0 h06a4308_610
iopath 0.1.8 py38 iopath
ipykernel 5.3.4 py38h5ca1d4c_0
ipython 7.22.0 py38hb070fc8_0
ipython_genutils 0.2.0 pyhd3eb1b0_1
ipywidgets 7.6.3 pyhd3eb1b0_1
jedi 0.17.0 py38_0
jinja2 3.0.1 pyhd3eb1b0_0
jpeg 9b h024ee3a_2
jsonschema 3.2.0 py_2
jupyter 1.0.0 py38_7
jupyter_client 6.1.12 pyhd3eb1b0_0
jupyter_console 6.4.0 pyhd3eb1b0_0
jupyter_core 4.7.1 py38h06a4308_0
jupyterlab_pygments 0.1.2 py_0
jupyterlab_widgets 1.0.0 pyhd3eb1b0_1
kiwisolver 1.3.1 pypi_0 pypi
kmeans-pytorch 0.3 pypi_0 pypi
kornia 0.5.3 pypi_0 pypi
lcms2 2.12 h3be6417_0
ld_impl_linux-64 2.33.1 h53a641e_7 conda-forge
libffi 3.3 he6710b0_2 anaconda
libgcc-ng 9.1.0 hdf63c60_0 anaconda
libpng 1.6.37 hbc83047_0 anaconda
libsodium 1.0.18 h7b6447c_0
libstdcxx-ng 9.1.0 hdf63c60_0 anaconda
libtiff 4.2.0 h85742a9_0
libuuid 1.0.3 h1bed415_2
libuv 1.40.0 h7b6447c_0 anaconda
libwebp-base 1.2.0 h27cfd23_0
libxcb 1.14 h7b6447c_0
libxml2 2.9.10 hb55368b_3
lz4-c 1.9.3 h2531618_0
markdown 3.3.4 pypi_0 pypi
markupsafe 2.0.1 py38h27cfd23_0
matplotlib 3.4.2 pypi_0 pypi
mistune 0.8.4 py38h7b6447c_1000
mkl 2021.2.0 h06a4308_296
mkl-service 2.3.0 py38h27cfd23_1
mkl_fft 1.3.0 py38h42c9631_2
mkl_random 1.2.1 py38ha9443f7_2
mypy-extensions 0.4.3 pypi_0 pypi
nbclient 0.5.3 pyhd3eb1b0_0
nbconvert 6.1.0 py38h06a4308_0
nbformat 5.1.3 pyhd3eb1b0_0
ncurses 6.2 he6710b0_1 anaconda
nest-asyncio 1.5.1 pyhd3eb1b0_0
networkx 2.6rc1 pypi_0 pypi
ninja 1.10.2 hff7bd54_1
notebook 6.4.0 py38h06a4308_0
numpy 1.20.2 py38h2d18471_0
numpy-base 1.20.2 py38hfae3a4d_0
oauthlib 3.1.1 pypi_0 pypi
olefile 0.46 py_0 conda-forge
omegaconf 2.1.0 pypi_0 pypi
opencv-python 4.4.0.46 pypi_0 pypi
openssl 1.1.1k h27cfd23_0
packaging 20.9 pyhd3eb1b0_0
pandas 1.2.4 pypi_0 pypi
pandocfilters 1.4.3 py38h06a4308_1
parso 0.8.2 pyhd3eb1b0_0
pathspec 0.8.1 pypi_0 pypi
pcre 8.45 h295c915_0
pexpect 4.8.0 pyhd3eb1b0_3
pickleshare 0.7.5 pyhd3eb1b0_1003
pillow 8.2.0 py38he98fc37_0
pip 21.1.1 py38h06a4308_0
portalocker 1.7.0 py38h578d9bd_1 conda-forge
prometheus_client 0.11.0 pyhd3eb1b0_0
prompt-toolkit 3.0.17 pyh06a4308_0
prompt_toolkit 3.0.17 hd3eb1b0_0
protobuf 3.17.2 pypi_0 pypi
ptyprocess 0.7.0 pyhd3eb1b0_2
pyasn1 0.4.8 pypi_0 pypi
pyasn1-modules 0.2.8 pypi_0 pypi
pycocotools 2.0.2 pypi_0 pypi
pycparser 2.20 py_2
pydot 1.4.2 pypi_0 pypi
pyglet 1.5.17 pypi_0 pypi
pygments 2.9.0 pyhd3eb1b0_0
pyopengl 3.1.0 pypi_0 pypi
pyparsing 3.0.0b2 pypi_0 pypi
pypng 0.0.20 pypi_0 pypi
pyqt 5.9.2 py38h05f1152_4
pyrender 0.1.45 pypi_0 pypi
pyrsistent 0.17.3 py38h7b6447c_0
pysocks 1.7.1 pypi_0 pypi
python 3.8.10 hdb3f193_7
python-dateutil 2.8.1 pyhd3eb1b0_0
python_abi 3.8 1_cp38 conda-forge
pytorch 1.7.1 py3.8_cuda11.0.221_cudnn8.0.5_0 pytorch
pytorch3d 0.4.0 py38_cu110_pyt171 pytorch3d
pytz 2021.1 pypi_0 pypi
pywavelets 1.1.1 pypi_0 pypi
pyyaml 5.3.1 py38h8df0ef7_1 conda-forge
pyzmq 20.0.0 py38h2531618_1
qt 5.9.7 h5867ecd_1
qtconsole 5.1.0 pyhd3eb1b0_0
qtpy 1.9.0 py_0
readline 8.1 h27cfd23_0
regex 2021.4.4 pypi_0 pypi
requests 2.25.1 pypi_0 pypi
requests-oauthlib 1.3.0 pypi_0 pypi
rsa 4.7.2 pypi_0 pypi
scikit-image 0.18.2rc2 pypi_0 pypi
scipy 1.6.3 pypi_0 pypi
send2trash 1.5.0 pyhd3eb1b0_1
setuptools 52.0.0 py38h06a4308_0
sip 4.19.13 py38he6710b0_0
six 1.15.0 py38h06a4308_0
soft-renderer 1.0.0 pypi_0 pypi
sqlite 3.35.4 hdfb4753_0
tabulate 0.8.9 pyhd8ed1ab_0 conda-forge
tensorboard 2.5.0 pypi_0 pypi
tensorboard-data-server 0.6.1 pypi_0 pypi
tensorboard-plugin-wit 1.8.0 pypi_0 pypi
termcolor 1.1.0 py_2 conda-forge
terminado 0.9.4 py38h06a4308_0
testpath 0.5.0 pyhd3eb1b0_0
tifffile 2021.4.8 pypi_0 pypi
tk 8.6.10 hbc83047_0 anaconda
toml 0.10.2 pypi_0 pypi
torchvision 0.8.2 py38_cu110 pytorch
tornado 6.1 py38h27cfd23_0
tqdm 4.61.0 pyhd8ed1ab_0 conda-forge
traitlets 5.0.5 pyhd3eb1b0_0
trimesh 3.9.20 pypi_0 pypi
typing_extensions 3.7.4.3 pyha847dfd_0
urllib3 1.26.5 pypi_0 pypi
wcwidth 0.2.5 py_0
webencodings 0.5.1 py38_1
werkzeug 2.0.1 pypi_0 pypi
wheel 0.36.2 pyhd3eb1b0_0
widgetsnbextension 3.5.1 py38_0
xz 5.2.5 h7b6447c_0 anaconda
yacs 0.1.6 py_0 conda-forge
yaml 0.2.5 h516909a_0 conda-forge
zeromq 4.3.4 h2531618_0
zipp 3.4.1 pyhd3eb1b0_0
zlib 1.2.11 h7b6447c_3 anaconda
zstd 1.4.9 haebb681_0
●pip list Package Version
absl-py 0.12.0 antlr4-python3-runtime 4.8 appdirs 1.4.4 argon2-cffi 20.1.0 async-generator 1.10 attrs 21.2.0 backcall 0.2.0 black 21.4b2 bleach 3.3.0 cachetools 4.2.2 certifi 2021.5.30 cffi 1.14.5 chardet 4.0.0 click 8.0.1 cloudpickle 1.6.0 cycler 0.10.0 Cython 0.29.23 decorator 5.0.9 defusedxml 0.7.1 entrypoints 0.3 environment-kernels 1.1.1 filelock 3.0.12 freetype-py 2.2.0 future 0.18.2 fvcore 0.1.5.post20210624 gdown 3.13.0 google-auth 1.30.1 google-auth-oauthlib 0.4.4 grpcio 1.38.0 hydra-core 1.1.0 idna 2.10 imageio 2.9.0 importlib-metadata 3.10.0 importlib-resources 5.2.0 iopath 0.1.8 ipykernel 5.3.4 ipython 7.22.0 ipython-genutils 0.2.0 ipywidgets 7.6.3 jedi 0.17.0 Jinja2 3.0.1 jsonschema 3.2.0 jupyter 1.0.0 jupyter-client 6.1.12 jupyter-console 6.4.0 jupyter-core 4.7.1 jupyterlab-pygments 0.1.2 jupyterlab-widgets 1.0.0 kiwisolver 1.3.1 kmeans-pytorch 0.3 kornia 0.5.3 Markdown 3.3.4 MarkupSafe 2.0.1 matplotlib 3.4.2 mistune 0.8.4 mkl-fft 1.3.0 mkl-random 1.2.1 mkl-service 2.3.0 mypy-extensions 0.4.3 nbclient 0.5.3 nbconvert 6.1.0 nbformat 5.1.3 nest-asyncio 1.5.1 networkx 2.6rc1 notebook 6.4.0 numpy 1.20.2 oauthlib 3.1.1 olefile 0.46 omegaconf 2.1.0 opencv-python 4.4.0.46 packaging 20.9 pandas 1.2.4 pandocfilters 1.4.3 parso 0.8.2 pathspec 0.8.1 pexpect 4.8.0 pickleshare 0.7.5 Pillow 8.2.0 pip 21.1.1 portalocker 1.7.0 prometheus-client 0.11.0 prompt-toolkit 3.0.17 protobuf 3.17.2 ptyprocess 0.7.0 pyasn1 0.4.8 pyasn1-modules 0.2.8 pycocotools 2.0.2 pycparser 2.20 pydot 1.4.2 pyglet 1.5.17 Pygments 2.9.0 PyOpenGL 3.1.0 pyparsing 3.0.0b2 pypng 0.0.20 pyrender 0.1.45 pyrsistent 0.17.3 PySocks 1.7.1 python-dateutil 2.8.1 pytorch3d 0.4.0 pytz 2021.1 PyWavelets 1.1.1 PyYAML 5.3.1 pyzmq 20.0.0 qtconsole 5.1.0 QtPy 1.9.0 regex 2021.4.4 requests 2.25.1 requests-oauthlib 1.3.0 rsa 4.7.2 scikit-image 0.18.2rc2 scipy 1.6.3 Send2Trash 1.5.0 setuptools 52.0.0.post20210125 sip 4.19.13 six 1.15.0 soft-renderer 1.0.0 tabulate 0.8.9 tensorboard 2.5.0 tensorboard-data-server 0.6.1 tensorboard-plugin-wit 1.8.0 termcolor 1.1.0 terminado 0.9.4 testpath 0.5.0 tifffile 2021.4.8 toml 0.10.2 torch 1.7.1 torchvision 0.8.2 tornado 6.1 tqdm 4.61.0 traitlets 5.0.5 trimesh 3.9.20 typing-extensions 3.7.4.3 urllib3 1.26.5 wcwidth 0.2.5 webencodings 0.5.1 Werkzeug 2.0.1 wheel 0.36.2 widgetsnbextension 3.5.1 yacs 0.1.6 zipp 3.4.1
Those looks correct. What GPU card were you using ?
I use TITAN X (Pascal)
Ok, I've validated on almost the same setting as yours.
If the same error exists, the best suggestion I can give is to print out the variable self.total_loss in the nnutils/mesh_net.py as mentioned earlier, and see where it becomes nan. For example, add print(self.total_loss)
after this line. It should be a valid number if everything is correct.
I inserted print(self.total_loss) and ran bash scripts/spot3.sh. When I continue in PDB mode, I get the following output.
/home/shiori/lasr/nnutils/train_utils.py(295)train() -> self.optimizer.step() (Pdb) c /home/shiori/lasr/nnutils/train_utils.py:76: RuntimeWarning: invalid value encountered in true_divide timg = (timg-timg.min())/(timg.max()-timg.min()) tensor(0.1620, device='cuda:0', grad_fn=
) tensor(nan, device='cuda:0', grad_fn= ) /home/shiori/lasr/nnutils/train_utils.py(294)train() -> pdb.set_trace() (Pdb) c tensor(0.1632, device='cuda:0', grad_fn= ) /home/shiori/lasr/nnutils/train_utils.py(295)train() -> self.optimizer.step()
Thanks. The code block l382-l522 in nnutils/mesh_net.py sequentially computes and adds the following losses: (1) mask loss, (2) flow loss, (3) rgb loss, (4) shape loss (5) deformation loss (6) bone symmetry loss (7) camera loss (8) auxiliary losses.
Given the information you provided, self.total_loss is valid after (1) but becomes nan after (8). To figure out when it becomes nan, you could print out the value of self.total_loss after adding each loss.
Thank you. It was nan from the output of (3).
/home/kana/anaconda3/envs/lasr/lib/python3.8/site-packages/torch/nn/functional.py:3385: UserWarning: Default grid_sample and affine_grid behavior has changed to align_corners=False since 1.3.0. Please specify align_corners=True if the old behavior is desired. See the documentation of grid_sample for details.
warnings.warn("Default grid_sample and affine_grid behavior has changed "
tensor(nan, device='cuda:0', grad_fn=
/home/kana/lasr/nnutils/train_utils.py(295)train() -> self.optimizer.step() (Pdb) c /home/kana/lasr/nnutils/train_utils.py:76: RuntimeWarning: invalid value encountered in true_divide timg = (timg-timg.min())/(timg.max()-timg.min()) tensor(0.1620, device='cuda:0', grad_fn=
) tensor(0.1620, device='cuda:0', grad_fn= ) tensor(nan, device='cuda:0', grad_fn= ) tensor(nan, device='cuda:0', grad_fn= ) tensor(nan, device='cuda:0', grad_fn= ) tensor(nan, device='cuda:0', grad_fn= ) tensor(nan, device='cuda:0', grad_fn= ) tensor(nan, device='cuda:0', grad_fn= ) /home/kana/lasr/nnutils/train_utils.py(294)train() -> pdb.set_trace()
Hi, are you able to further localize which line produces invalid value, that made the self.total_loss become nan?
The output of self.texture_loss_sub and self.texture_loss is as follows.
self.texture_loss_sub : tensor([[ nan, nan, nan, nan, nan, nan, 0.1660, 0.1662],
[0.1692, 0.1693, 0.1693, 0.1695, 0.1692, 0.1693, 0.1694, 0.1695]],
device='cuda:0', grad_fn=
Thanks, the self.texture_loss_sub is the sum of (1) rgb loss and (2) perceptual loss, it would help to localize further.
Thank you. I got the following output.
rgb_loss: tensor(nan, device='cuda:0', grad_fn=
percept_loss: tensor([ nan, nan, nan, nan, nan, nan, 2.1181, 2.1905, 2.1126,
2.1569, 2.1562, 2.1891, 2.1188, 2.1328, 2.2336, 2.2318, nan, nan,
nan, nan, nan, nan, 2.0471, 2.0863, 1.9113, 1.9823, 1.9965,
2.0729, 1.9108, 1.9763, 1.9714, 2.0456], device='cuda:0',
grad_fn=
I'm not sure what happens. Could you add these lines before computing perceptual loss
data_save = {}
data_save['obspair'] = obspair.detach().cpu().numpy()
data_save['rndpair'] = rndpair.detach().cpu().numpy()
data_save['verts_pre'] = verts_pre.detach().cpu().numpy()
data_save['faces']=faces.detach().cpu().numpy()
data_save['tex']=tex.detach().cpu().numpy()
np.save('./data.npy', data_save)
and share the saved data.npy file in the current folder with me? I will check whether it's a rendering problem or data loading problem.
Since I don't know how to share npy files, the output is given below.
{'obspair': array([[[[0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], ..., [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.]],
[[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]],
[[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]]],
[[[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]],
[[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]],
[[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]]],
[[[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]],
[[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]],
[[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]]],
...,
[[[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
...,
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.]],
[[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
...,
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.]],
[[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
...,
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.]]],
[[[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
...,
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.]],
[[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
...,
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.]],
[[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
...,
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.]]],
[[[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
...,
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.]],
[[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
...,
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.]],
[[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
...,
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.]]]], dtype=float32), 'rndpair': array([[[[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]],
[[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]],
[[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]]],
[[[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]],
[[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]],
[[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]]],
[[[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]],
[[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]],
[[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]]],
...,
[[[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
...,
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.]],
[[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
...,
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.]],
[[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
...,
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.]]],
[[[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
...,
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.]],
[[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
...,
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.]],
[[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
...,
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.]]],
[[[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
...,
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.]],
[[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
...,
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.]],
[[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
...,
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.],
[1., 1., 1., ..., 1., 1., 1.]]]], dtype=float32), 'verts_pre': array([[[ 0.03669876, -0.53712124, 19.928108 ],
[ 0.11251941, -0.13512413, 9.785472 ],
[ 11.271721 , -0.02641002, 8.839373 ],
...,
[ 1.0225617 , -11.714436 , 8.412083 ],
[ -0.10530925, -0.42463654, 20.043804 ],
[ -0.16680032, 0.17549148, 9.568409 ]],
[[ 12.358375 , 0.38589263, 8.524963 ],
[ 0.8437464 , -12.515985 , 8.428447 ],
[ -0.23052686, -0.41224524, 20.325195 ],
...,
[ -0.4429301 , 0.1580073 , 9.713904 ],
[ 12.616431 , 0.42210066, 8.628665 ],
[ 0.9909441 , -12.903456 , 8.391498 ]],
[[ -0.18003827, -0.4161254 , 19.86953 ],
[ -0.44757855, 0.15658699, 9.562622 ],
[ 11.271862 , 0.41678354, 8.536382 ],
...,
[ 1.0186372 , -11.691986 , 8.67606 ],
[ -0.1390848 , -0.5128202 , 20.368404 ],
[ -0.08979135, 0.0807905 , 9.777859 ]],
...,
[[ 13.212509 , -14.096129 , 8.077927 ],
[ -0.12603344, -23.501102 , 9.248735 ],
[ -0.13886106, -6.121924 , 19.605312 ],
...,
[ 1.4911414 , -12.682963 , 8.095443 ],
[ 13.268127 , -13.988748 , 8.102279 ],
[ 0.13536483, -23.306433 , 9.242769 ]],
[[ -0.31965658, -5.2719135 , 19.70742 ],
[ 0.8037931 , -10.754548 , 8.323112 ],
[ 10.802569 , -11.863282 , 8.329947 ],
...,
[ -0.17284162, -19.642536 , 9.535197 ],
[ -0.3779588 , -5.50803 , 19.628155 ],
[ 0.5156373 , -10.347673 , 8.8016 ]],
[[ 10.569969 , -11.977585 , 8.30729 ],
[ -0.1654361 , -19.899996 , 9.459713 ],
[ -0.3769737 , -5.5397005 , 19.55267 ],
...,
[ 0.62245774, -11.1361065 , 8.21687 ],
[ 10.590989 , -11.898343 , 8.455078 ],
[ -0.59070694, -20.045033 , 9.4779825 ]]], dtype=float32), 'faces': array([[[105, 16, 410],
[ 8, 105, 410],
[ 16, 413, 108],
...,
[352, 546, 565],
[391, 561, 546],
[395, 561, 565]],
[[105, 16, 410],
[ 8, 105, 410],
[ 16, 413, 108],
...,
[352, 546, 565],
[391, 561, 546],
[395, 561, 565]]]), 'tex': array([[[0.28400886, 0.39523867, 0.06636933],
[0.53623235, 0.4698201 , 0.3589957 ],
[0.34913328, 0.41863316, 0.2545403 ],
...,
[0.75019354, 0.47404566, 0.6776529 ],
[0.68552405, 0.5935845 , 0.2544528 ],
[0.20894824, 0.70933825, 0.37567273]],
[[0.28400886, 0.39523867, 0.06636933],
[0.53623235, 0.4698201 , 0.3589957 ],
[0.34913328, 0.41863316, 0.2545403 ],
...,
[0.75019354, 0.47404566, 0.6776529 ],
[0.68552405, 0.5935845 , 0.2544528 ],
[0.20894824, 0.70933825, 0.37567273]],
[[0.28400886, 0.39523867, 0.06636933],
[0.53623235, 0.4698201 , 0.3589957 ],
[0.34913328, 0.41863316, 0.2545403 ],
...,
[0.75019354, 0.47404566, 0.6776529 ],
[0.68552405, 0.5935845 , 0.2544528 ],
[0.20894824, 0.70933825, 0.37567273]],
...,
[[0.28400886, 0.39523867, 0.06636933],
[0.53623235, 0.4698201 , 0.3589957 ],
[0.34913328, 0.41863316, 0.2545403 ],
...,
[0.75019354, 0.47404566, 0.6776529 ],
[0.68552405, 0.5935845 , 0.2544528 ],
[0.20894824, 0.70933825, 0.37567273]],
[[0.28400886, 0.39523867, 0.06636933],
[0.53623235, 0.4698201 , 0.3589957 ],
[0.34913328, 0.41863316, 0.2545403 ],
...,
[0.75019354, 0.47404566, 0.6776529 ],
[0.68552405, 0.5935845 , 0.2544528 ],
[0.20894824, 0.70933825, 0.37567273]],
[[0.28400886, 0.39523867, 0.06636933],
[0.53623235, 0.4698201 , 0.3589957 ],
[0.34913328, 0.41863316, 0.2545403 ],
...,
[0.75019354, 0.47404566, 0.6776529 ],
[0.68552405, 0.5935845 , 0.2544528 ],
[0.20894824, 0.70933825, 0.37567273]]], dtype=float32)}
Can you share through google drive or email (gengshany@cmu.edu)?
OK. The email has been sent.
Hi, the problem I found is that the variable "verts_pre" contains identical values. To be more specific, verts_pre is the Nx3 matrix where each row is the view space (x,y,z) coordinate of a mesh vertex.
I don't know which part went wrong during the transformation from rest shape coordinate to view space. Could you share more information as follows?
import json
data_save = {}
data_save['pred_v'] = pred_v.detach().cpu().numpy().tolist() # rest
data_save['verts_tex'] = verts_tex.detach().cpu().numpy().tolist() # camera
data_save['verts_pre'] = verts_pre.detach().cpu().numpy().tolist() # view
data_save['offset'] = offset.detach().cpu().numpy().tolist()
with open('data.json', 'w') as f: json.dump(data_save, f)
Thank you. I will share the information you mentioned.
https://drive.google.com/file/d/1fZXvDXO_QE6A2yGhdLH9h4AvWkHj3MoU/view?usp=sharing
2021年7月8日(木) 8:26 Gengshan Yang @.***>:
Hi, the problem I found is that the variable "verts_pre https://github.com/google/lasr/blob/a22780b533079befd29d0bc5e0f9ca8b95f43873/nnutils/mesh_net.py#L347" contains identical values. To be more specific, verts_pre is the Nx3 matrix where each row is the view space (x,y,z) coordinate of a mesh vertex.
I don't know which part went wrong during the transformation from rest shape coordinate to view space. Could you share more information as follows?
import json data_save = {} data_save['pred_v'] = pred_v.detach().cpu().numpy().tolist() # rest data_save['verts_tex'] = verts_tex.detach().cpu().numpy().tolist() # camera data_save['verts_pre'] = verts_pre.detach().cpu().numpy().tolist() # view data_save['offset'] = offset.detach().cpu().numpy().tolist() with open('data.json', 'w') as f: json.dump(data_save, f)
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/google/lasr/issues/3#issuecomment-875999672, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUNDDDSG57XYI5DQAWP47RTTWTPBLANCNFSM47T55AMA .
Thanks, we can nail down the problem to this block
The output after blend skinning and camera projection was wrong. I guess that's due to wrong transformation and camera parameters.
I will need the following information to further look into it.
import json
data_save = {}
data_save['Rmat_tex'] = Rmat_tex.detach().cpu().numpy().tolist() # rotations
data_save['quat'] = quat.detach().cpu().numpy().tolist() # rotations
data_save['Tmat'] = Tmat.detach().cpu().numpy().tolist() # translation xyz
data_save['trans'] = trans.detach().cpu().numpy().tolist() # translation xy
data_save['depth'] = depth.detach().cpu().numpy().tolist() # translation z
data_save['ppoint'] = ppoint.detach().cpu().numpy().tolist() # principle points
data_save['scale'] = scale.detach().cpu().numpy().tolist() # focal length
data_save['cams'] = self.cams.detach().cpu().numpy().tolist() # camera calibration
data_save['pp'] = self.pp.detach().cpu().numpy().tolist() # camera calibration
with open('data.json', 'w') as f: json.dump(data_save, f)
Can you also share the checkpoint file log/spot3-0/pred_net_0.pth
?
Thank you.
I will share the two files below.
https://drive.google.com/drive/folders/123NpMzh_4ZJnsTO0TQ_DXGdzf0ViTvSK?usp=sharing
2021年7月10日(土) 4:37 Gengshan Yang @.***>:
Thanks, we can nail down the problem to this block https://github.com/google/lasr/blob/a22780b533079befd29d0bc5e0f9ca8b95f43873/nnutils/mesh_net.py#L343-L345
The output after blend skinning and camera projection was wrong. I guess that's due to wrong transformation and camera parameters.
I will need the following information to further look into it.
import json data_save = {} data_save['Rmat_tex'] = Rmat_tex.detach().cpu().numpy().tolist() # rotations data_save['quat'] = quat.detach().cpu().numpy().tolist() # rotations data_save['Tmat'] = Tmat.detach().cpu().numpy().tolist() # translation xyz data_save['trans'] = trans.detach().cpu().numpy().tolist() # translation xy data_save['depth'] = depth.detach().cpu().numpy().tolist() # translation z data_save['ppoint'] = ppoint.detach().cpu().numpy().tolist() # principle points data_save['scale'] = scale.detach().cpu().numpy().tolist() # focal length data_save['cams'] = self.cams.detach().cpu().numpy().tolist() # camera calibration data_save['pp'] = self.pp.detach().cpu().numpy().tolist() # camera calibration with open('data.json', 'w') as f: json.dump(data_save, f)
Can you also share the checkpoint file log/spot3-0/pred_net_0.pth?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/google/lasr/issues/3#issuecomment-877413701, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUNDDDW6EFLRPLCJVJZ2WMDTW5FXFANCNFSM47T55AMA .
Thanks, the rotation, translation, focal length, and principal points all look correct. It has to be the problem of skinning weights. However, I'm still not able to reproduce the problem even with the checkpoint you provided.
Could you also provide skinning weights?
import json
data_save = {}
data_save['skin'] = skin.detach().cpu().numpy().tolist() # skinning weights
data_save['ctl_ts'] = self.ctl_ts.detach().cpu().numpy().tolist() # bone centroid
data_save['ctl_rs'] = self.ctl_rs.detach().cpu().numpy().tolist() # bone orientation
data_save['log_ctl'] = self.log_ctl.detach().cpu().numpy().tolist() # bone scale
with open('data.json', 'w') as f: json.dump(data_save, f)
Thank you.Share the file.
https://drive.google.com/drive/folders/123NpMzh_4ZJnsTO0TQ_DXGdzf0ViTvSK?usp=sharing
2021年7月10日(土) 13:17 Gengshan Yang @.***>:
Thanks, the rotation, translation, focal length, and principal points all look correct. It has to be the problem of skinning weights. However, I'm still not able to reproduce the problem even with the checkpoint you provided.
Could you also provide skinning weights?
import json data_save = {} data_save['skin'] = skin.detach().cpu().numpy().tolist() # skinning weights with open('data.json', 'w') as f: json.dump(data_save, f)
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/google/lasr/issues/3#issuecomment-877561297, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUNDDDUR7IW2VINT2VPLGYDTW7CVFANCNFSM47T55AMA .
The file does not contain all the variables listed. Could you check the updated code and share again?
I'm sorry. I've updated it.
https://drive.google.com/drive/folders/123NpMzh_4ZJnsTO0TQ_DXGdzf0ViTvSK?usp=sharing
2021年7月11日(日) 4:07 Gengshan Yang @.***>:
The file does not contain all the variables listed. Could you check the updated code and share again?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/google/lasr/issues/3#issuecomment-877688546, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUNDDDRPE6AME2MJU3GEL5DTXCK5XANCNFSM47T55AMA .
The rest bone location, orientation, scale all look correct. I guess it is a numerical issue. Could you try replace this line with
skin = (-10 * dis_norm.sum(3)).double().softmax(1).float()[:,:,:,None] # h,j,n,1
If it still does not fix the issue, can you share with me the following?
import json
data_save = {}
data_save['skin'] = skin.detach().cpu().numpy().tolist() # skinning weights
data_save['ctl_ts'] = self.ctl_ts.detach().cpu().numpy().tolist() # bone centroid
data_save['ctl_rs'] = self.ctl_rs.detach().cpu().numpy().tolist() # bone orientation
data_save['log_ctl'] = self.log_ctl.detach().cpu().numpy().tolist() # bone scale
data_save['dis_norm'] = dis_norm.detach().cpu().numpy().tolist() # mahalanobis distance
with open('data.json', 'w') as f: json.dump(data_save, f)
Thank you.
I fixed the code and ran it However, it did not solve the problem.
https://drive.google.com/drive/folders/123NpMzh_4ZJnsTO0TQ_DXGdzf0ViTvSK?usp=sharing
2021年7月11日(日) 15:15 Gengshan Yang @.***>:
The rest bone location, orientation, scale all look correct. I guess it is a numerical issue. Could you try replace this line https://github.com/google/lasr/blob/a22780b533079befd29d0bc5e0f9ca8b95f43873/nnutils/mesh_net.py#L260 with
skin = (-10 * dis_norm.sum(3)).double().softmax(1).float()[:,:,:,None] # h,j,n,1
If it still does not fix the issue, can you share with me the following?
import json data_save = {} data_save['skin'] = skin.detach().cpu().numpy().tolist() # skinning weights data_save['ctl_ts'] = self.ctl_ts.detach().cpu().numpy().tolist() # bone centroid data_save['ctl_rs'] = self.ctl_rs.detach().cpu().numpy().tolist() # bone orientation data_save['log_ctl'] = self.log_ctl.detach().cpu().numpy().tolist() # bone scale data_save['dis_norm'] = dis_norm.detach().cpu().numpy().tolist() # bone scale with open('data.json', 'w') as f: json.dump(data_save, f)
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/google/lasr/issues/3#issuecomment-877748008, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUNDDDROEDULOCIWK5LEUVDTXEZI7ANCNFSM47T55AMA .
What about replacing this line with
from pytorch3d import transforms
ctl_rs = torch.cat([self.ctl_rs[:,3:4], self.ctl_rs[:,:3]],-1)
dis_norm = dis_norm.matmul(transforms.quaternion_to_matrix(ctl_rs).view(opts.n_hypo,-1,3,3)) # h,j,n,3
Thank you. The result did not change. https://drive.google.com/drive/folders/123NpMzh_4ZJnsTO0TQ_DXGdzf0ViTvSK?usp=sharing
2021年7月11日(日) 15:58 Gengshan Yang @.***>:
What about replacing this line https://github.com/google/lasr/blob/a22780b533079befd29d0bc5e0f9ca8b95f43873/nnutils/mesh_net.py#L258 with
from pytorch3d import transforms ctl_rs = torch.cat([self.ctl_rs[:,3:4], self.ctl_rs[:,:3]],-1) dis_norm = dis_norm.matmul(transforms.quaternion_to_matrix(ctl_rs).view(opts.n_hypo,-1,3,3)) # h,j,n,3
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/google/lasr/issues/3#issuecomment-877751833, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUNDDDTPF27PG3IHSTYA4VTTXE6HTANCNFSM47T55AMA .
There is something apparently wrong in these lines that computs skinning weights, because you are getting dis_norm = 0.
dis_norm is the mahalanobis distance computed by (1) subtraction from bone position by vertex position (2) rotation and (3) scaling.
Can you debug a little bit and let me know when dis_norm becomes zeros?
I ran the following code to output dis_norm. The output of (1) was not zero, does that mean that all outputs need to be zero?
dis_norm = (self.ctl_ts.view(opts.n_hypo,-1,1,3) - pred_v.view(2*local_batch_size,opts.n_hypo,-1,3)[0,:,None].detach()) # p-v, H,J,1,3 - H,1,N,3
print('(1)',dis_norm)
#dis_norm = dis_norm.matmul(kornia.quaternion_to_rotation_matrix(self.ctl_rs).view(opts.n_hypo,-1,3,3)) # h,j,n,3
from pytorch3d import transforms
ctl_rs = torch.cat([self.ctl_rs[:,3:4], self.ctl_rs[:,:3]],-1)
dis_norm = dis_norm.matmul(transforms.quaternion_to_matrix(ctl_rs).view(opts.n_hypo,-1,3,3)) # h,j,n,3
print('(2)',dis_norm)
dis_norm = self.log_ctl.exp().view(opts.n_hypo,-1,1,3) * dis_norm.pow(2) # (p-v)^TS(p-v)
print('(3)',dis_norm)
The output of dis_norm should not be zero in any case. Does the output of transforms.quaternion_to_matrix(ctl_rs)
look reasonable? They should be identity matrices of shape 3x3.
The outputs of ( 2) and (3 ) are both zero. The output continues to look like the following.
tensor([[[[0., 0., 0.], [0., 0., 0.], [0., 0., 0.], ..., [0., 0., 0.], [0., 0., 0.], [0., 0., 0.]],
transforms.quaternion_to_matrix(ctl_rs) gives the following output, which I think is reasonable.
[[1., 0., 0.], [0., 1., 0.], [0., 0., 1.]],
The following code was used to save the output of (2) to a json file. It turned out that zero was output. https://drive.google.com/file/d/1LqO9NTBZGmc9Qt0jppJqDVbkXc9JxSmb/view?usp=sharing
from pytorch3d import transforms ctl_rs = torch.cat([self.ctl_rs[:,3:4], self.ctl_rs[:,:3]],-1) dis_norm = dis_norm.matmul(transforms.quaternion_to_matrix(ctl_rs).view(opts.n_hypo,-1,3,3)) # h,j,n,3
import json
data_save = {}
data_save['dis_norm.matmul(transforms.quaternion_to_matrix(ctl_rs).view(opts.n_hypo,-1,3,3))'] = dis_norm.matmul(transforms.quaternion_to_matrix(ctl_rs).view(opts.n_hypo,-1,3,3)).detach().cpu().numpy().tolist() # mahalanobis distance
with open('data_2.json', 'w') as f: json.dump(data_save, f)
However, when I checked by saving dis_norm to a json file with the following code, it was not zero. https://drive.google.com/file/d/1yXH6COK_wgcOjejPtXe3rYXob-XRqRwU/view?usp=sharing
dis_norm = self.log_ctl.exp().view(opts.n_hypo,-1,1,3) * dis_norm.pow(2) # (p-v)^TS(p-v)
import json
data_save = {}
data_save['dis_norm'] =dis_norm.detach().cpu().numpy().tolist()
with open('data_3.json', 'w') as f: json.dump(data_save, f)
Ok, multiplying an identity matrix should not change the values from non-zero to zero. Can you verify this? Something was wrong going from (1) to (2).
The dimension of dis_norm should be HxBxNx3 before multiplication with the HxBx3x3 rotation matrix transforms.quaternion_to_matrix(ctl_rs)
. Maybe you could verify this by doing matrix multiplication at element level.
rotmat = transforms.quaternion_to_matrix(ctl_rs)
print(dis_norm[0,0].matmul(rotmat[0,0]))
Excuse me for asking again and again.
"bash scripts/spot3.sh". After running the above code, terminal goes into Pdb mode. What should I enter here?