loss is nan when training

qinghuannn commented 1 year ago

Hi, Thanks for your nice work! Loss is nan when I set the batchsize as 16 and train the model on 8 RTX 2080Ti via the command python train.py. How to solve this problem?

jihoonerd commented 1 year ago

@qinghuannn Thanks for your interest in this work! Can you know me configs used for the training and your installed packages' version?

qinghuannn commented 1 year ago

@qinghuannn Thanks for your interest in this work! Can you know me configs used for the training and your installed packages' version?

I train the model via the command python train.py and do not add other parameters. The default config configs/train.yml seems to be used. The installed packages's version is shown in the last part. Does the incorrect packages' version cause this problem?

And I get an overflow when I preprocess data via python scripts/prepare_humanml3d.py. Does this matter?

Package                 Version            Editable project location                                                                                                                                             ----------------------- ------------------ --------------------------------------------------                                                                                                                    absl-py                 1.3.0                                                                                                                                                                                    aiohttp                 3.8.3                                                                                                                                                                                    aiosignal               1.3.1                                                                                                                                                                                    alembic                 1.9.0
antlr4-python3-runtime  4.9.3
async-timeout           4.0.2
attrs                   22.1.0
autopage                0.5.1
black                   22.12.0
body-visualizer         1.1.0
brotlipy                0.7.0
cachetools              5.2.0
certifi                 2022.12.7
cffi                    1.15.0
cfgv                    3.3.1
charset-normalizer      2.1.1
click                   8.1.3
cliff                   4.1.0
cmaes                   0.9.0
cmd2                    2.4.2
colorlog                6.7.0
colour                  0.1.5
commonmark              0.9.1
configparser            5.3.0
contourpy               1.0.6
cryptography            37.0.2
cycler                  0.11.0
decorator               4.4.2
distlib                 0.3.6
dotmap                  1.3.30
exceptiongroup          1.0.4
fastjsonschema          2.16.2
filelock                3.8.2
flake8                  6.0.0
fonttools               4.38.0
freetype-py             2.3.0
frozenlist              1.3.3
fsspec                  2022.11.0
ftfy                    6.1.1
fvcore                  0.1.5.post20221213
google-auth             2.15.0
google-auth-oauthlib    0.4.6
greenlet                2.0.1
grpcio                  1.51.1
huggingface-hub         0.11.1
human-body-prior        2.2.2.0            /home/xxx/workspace/tools/human_body_prior/src
hydra-colorlog          1.2.0
hydra-colorlog          1.2.0
hydra-core              1.3.0
hydra-optuna-sweeper    1.2.0
identify                2.5.10
idna                    3.4
imageio                 2.23.0
imageio-ffmpeg          0.4.7
importlib-metadata      5.2.0
importlib-resources     5.10.1
iniconfig               1.1.1
iopath                  0.1.10
isort                   5.11.3
jedi                    0.18.2
jsonschema              4.17.3
jupyter_core            5.1.0
kiwisolver              1.4.4
lightning-utilities     0.4.2
loguru                  0.6.0
Mako                    1.2.4
Markdown                3.4.1
MarkupSafe              2.1.1
matplotlib              3.2.2
mccabe                  0.7.0
mkl-fft                 1.3.1
mkl-random              1.2.2
mkl-service             2.4.0
moviepy                 1.0.3
multidict               6.0.3
mypy-extensions         0.4.3
nbformat                5.7.0
nbstripout              0.6.1
networkx                2.8.8
nodeenv                 1.7.0
numpy                   1.24.0
oauthlib                3.2.2
olefile                 0.46
omegaconf               2.3.0
opencv-python           4.5.1.48
optuna                  2.10.1
packaging               22.0
pandas                  1.5.2
parso                   0.8.3
pathspec                0.10.3
pbr                     5.11.0
Pillow                  9.3.0
pip                     22.3.1
pkgutil_resolve_name    1.3.10
platformdirs            2.6.0
pluggy                  1.0.0
portalocker             2.6.0
pre-commit              2.20.0
prettytable             3.5.0
proglog                 0.1.10
protobuf                3.20.1
psbody-mesh             0.4
pudb                    2022.1.3
pyasn1                  0.4.8
pyasn1-modules          0.2.8
pycodestyle             2.10.0
pycparser               2.21
pyflakes                3.0.1
pyglet                  2.0.2.1
Pygments                2.13.0
PyOpenGL                3.1.0
PyOpenGL-accelerate     3.1.5
pyOpenSSL               22.0.0
pyparsing               3.0.9
pyperclip               1.8.2
pyrender                0.1.43
pyrsistent              0.19.2
PySocks                 1.7.1
pytest                  7.2.0
python-dateutil         2.8.2
python-dotenv           0.21.0
pytorch-lightning       1.8.5.post0
pytorch3d               0.7.2
pytz                    2022.7
PyYAML                  6.0
pyzmq                   24.0.1
regex                   2022.10.31
requests                2.28.1
requests-oauthlib       1.3.1
rich                    12.6.0
rsa                     4.9
scipy                   1.9.3
setuptools              65.6.3
sh                      1.14.3
six                     1.16.0
SQLAlchemy              1.4.45
stevedore               4.1.1
tabulate                0.9.0
tensorboard             2.11.0
tensorboard-data-server 0.6.1
tensorboard-plugin-wit  1.8.1
tensorboardX            2.5.1
termcolor               2.1.1
tokenizers              0.12.1
toml                    0.10.2
tomli                   2.0.1
torch                   1.12.0
torchaudio              0.12.0
torchmetrics            0.11.0
torchvision             0.13.0
tqdm                    4.64.1
traitlets               5.7.1
transformers            4.21.1
transforms3d            0.3.1
trimesh                 3.9.5
typing_extensions       4.4.0
urllib3                 1.26.13
urwid                   2.1.2
urwid-readline          0.13
virtualenv              20.17.1
wcwidth                 0.2.5
Werkzeug                2.2.2
wheel                   0.37.1
yacs                    0.1.8
yarl                    1.8.2
zipp                    3.11.0

qinghuannn commented 1 year ago

When tracking the overflow in preprocess, I find that the data P04G01R03F0343T0437A0501.npy from your supported humanact12 causes this error, as shown in following codes.

>>> data = np.load('humanact12_processed.pkl', allow_pickle=True)
>>> tmp = data['P04G01R03F0343T0437A0501.npy']
>>> tmp['joints3D'].min(), tmp['joints3D'].max()
(-0.806142492005127, 4.837814713475272e+276)
>>> tmp['joints3D'].mean(), tmp['joints3D'].std()
(7.07282852847262e+272, inf)

jihoonerd commented 1 year ago

@qinghuannn Thanks for finding this! I think it is good to exclude the samples with the abnormally large values at here. I will soon update the code to make it fail-safe.

jihoonerd commented 1 year ago

@qinghuannn However, there still is a possibility that some long motion clips (especially in HumanAct12) can take a large amount of memory, which causes the runtime overflow error unless you have a machine with abundant memories.

qinghuannn commented 1 year ago

@jihoonerd I'm sure that it's not cauesed by limited memory since I run all codes on a machine with large memories (256GB RAM). After excluding all abnormal data P07G01R02F0401T0607A0201.npy, P10G01R01F1418T1500A0604.npy and P04G01R03F0343T0437A0501.npy, the training loss seems normal.

qinghuannn commented 1 year ago

@jihoonerd The code of test pipeline has many bugs. At eval_util.py, some non-existing keys are called, such as meta["gt_translation"]， meta["clip_score_norm"] , meta["mm_distance_norm"] and so on.
Hope the auther check it and update it. Thank you very much!

kakaobrain / flame

loss is nan when training #2