txie-93 / cdvae

An SE(3)-invariant autoencoder for generating the periodic structure of materials [ICLR 2022]
MIT License
211 stars 85 forks source link

Code Hanging At Start of Training #46

Open AseemGill opened 11 months ago

AseemGill commented 11 months ago

Hi, I am running the CDVAE carbon experiment and I have been seeing a weird error. It appears that my code will just hang after completely three iterations of the first epoch.

I run **python cdvae/run.py data=carbon expname=carbon model.predict_property=True**

The output I see is this:

`[2023-07-13 16:57:36,190][hydra.utils][INFO] - Instantiating <cdvae.pl_data.datamodule.CrystDataModule>
[2023-07-13 16:57:37,161][numexpr.utils][INFO] - Note: detected 128 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable.
[2023-07-13 16:57:37,161][numexpr.utils][INFO] - Note: NumExpr detected 128 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
 25%|█████████████████████▍                                                                | 1521/6091 [00:25<01:29, 50.81it/s]/home/.conda/envs/cdvae/lib/python3.8/site-packages/pymatgen/io/cif.py:1120: UserWarning: Issues encountered while parsing CIF: Some fractional coordinates rounded to ideal values to avoid issues with finite precision.
  warnings.warn("Issues encountered while parsing CIF: " + "\n".join(self.warnings))
 34%|█████████████████████████████▎                                                        | 2080/6091 [00:34<01:05, 61.51it/s]/home/.conda/envs/cdvae/lib/python3.8/site-packages/pymatgen/io/cif.py:1120: UserWarning: Issues encountered while parsing CIF: Some fractional coordinates rounded to ideal values to avoid issues with finite precision.
  warnings.warn("Issues encountered while parsing CIF: " + "\n".join(self.warnings))
 46%|███████████████████████████████████████▊                                              | 2820/6091 [00:46<01:02, 52.70it/s]/home/.conda/envs/cdvae/lib/python3.8/site-packages/pymatgen/io/cif.py:1120: UserWarning: Issues encountered while parsing CIF: Some fractional coordinates rounded to ideal values to avoid issues with finite precision.
  warnings.warn("Issues encountered while parsing CIF: " + "\n".join(self.warnings))
 50%|██████████████████████████████████████████▋                                           | 3021/6091 [00:49<00:52, 58.26it/s]/home/.conda/envs/cdvae/lib/python3.8/site-packages/pymatgen/io/cif.py:1120: UserWarning: Issues encountered while parsing CIF: Some fractional coordinates rounded to ideal values to avoid issues with finite precision.
  warnings.warn("Issues encountered while parsing CIF: " + "\n".join(self.warnings))
 51%|███████████████████████████████████████████▍                                          | 3079/6091 [00:50<00:54, 55.41it/s]/home/.conda/envs/cdvae/lib/python3.8/site-packages/pymatgen/io/cif.py:1120: UserWarning: Issues encountered while parsing CIF: Some fractional coordinates rounded to ideal values to avoid issues with finite precision.
  warnings.warn("Issues encountered while parsing CIF: " + "\n".join(self.warnings))
 51%|████████████████████████████████████████████▏                                         | 3132/6091 [00:51<00:40, 72.78it/s]/home/.conda/envs/cdvae/lib/python3.8/site-packages/pymatgen/io/cif.py:1120: UserWarning: Issues encountered while parsing CIF: Some fractional coordinates rounded to ideal values to avoid issues with finite precision.
  warnings.warn("Issues encountered while parsing CIF: " + "\n".join(self.warnings))
 52%|████████████████████████████████████████████▎                                         | 3140/6091 [00:51<00:49, 59.77it/s]/home/.conda/envs/cdvae/lib/python3.8/site-packages/pymatgen/io/cif.py:1120: UserWarning: Issues encountered while parsing CIF: Some fractional coordinates rounded to ideal values to avoid issues with finite precision.
  warnings.warn("Issues encountered while parsing CIF: " + "\n".join(self.warnings))
 60%|███████████████████████████████████████████████████▊                                  | 3673/6091 [00:59<00:38, 63.39it/s]/home/.conda/envs/cdvae/lib/python3.8/site-packages/pymatgen/io/cif.py:1120: UserWarning: Issues encountered while parsing CIF: Some fractional coordinates rounded to ideal values to avoid issues with finite precision.
  warnings.warn("Issues encountered while parsing CIF: " + "\n".join(self.warnings))
 66%|████████████████████████████████████████████████████████▋                             | 4018/6091 [01:05<00:32, 63.75it/s]/home/.conda/envs/cdvae/lib/python3.8/site-packages/pymatgen/io/cif.py:1120: UserWarning: Issues encountered while parsing CIF: Some fractional coordinates rounded to ideal values to avoid issues with finite precision.
  warnings.warn("Issues encountered while parsing CIF: " + "\n".join(self.warnings))
 67%|█████████████████████████████████████████████████████████▌                            | 4077/6091 [01:06<00:33, 60.74it/s]/home/.conda/envs/cdvae/lib/python3.8/site-packages/pymatgen/io/cif.py:1120: UserWarning: Issues encountered while parsing CIF: Some fractional coordinates rounded to ideal values to avoid issues with finite precision.
  warnings.warn("Issues encountered while parsing CIF: " + "\n".join(self.warnings))
 67%|█████████████████████████████████████████████████████████▊                            | 4098/6091 [01:06<00:29, 67.92it/s]/home/.conda/envs/cdvae/lib/python3.8/site-packages/pymatgen/io/cif.py:1120: UserWarning: Issues encountered while parsing CIF: Some fractional coordinates rounded to ideal values to avoid issues with finite precision.
  warnings.warn("Issues encountered while parsing CIF: " + "\n".join(self.warnings))
 69%|███████████████████████████████████████████████████████████▊                          | 4233/6091 [01:08<00:29, 63.53it/s]/home/.conda/envs/cdvae/lib/python3.8/site-packages/pymatgen/io/cif.py:1120: UserWarning: Issues encountered while parsing CIF: Some fractional coordinates rounded to ideal values to avoid issues with finite precision.
  warnings.warn("Issues encountered while parsing CIF: " + "\n".join(self.warnings))
 74%|████████████████████████████████████████████████████████████████                      | 4536/6091 [01:13<00:23, 67.50it/s]/home/.conda/envs/cdvae/lib/python3.8/site-packages/pymatgen/io/cif.py:1120: UserWarning: Issues encountered while parsing CIF: Some fractional coordinates rounded to ideal values to avoid issues with finite precision.
  warnings.warn("Issues encountered while parsing CIF: " + "\n".join(self.warnings))
 80%|████████████████████████████████████████████████████████████████████▋                 | 4869/6091 [01:18<00:16, 72.17it/s]/home/.conda/envs/cdvae/lib/python3.8/site-packages/pymatgen/io/cif.py:1120: UserWarning: Issues encountered while parsing CIF: Some fractional coordinates rounded to ideal values to avoid issues with finite precision.
  warnings.warn("Issues encountered while parsing CIF: " + "\n".join(self.warnings))
 84%|████████████████████████████████████████████████████████████████████████              | 5106/6091 [01:22<00:18, 53.65it/s]/home/.conda/envs/cdvae/lib/python3.8/site-packages/pymatgen/io/cif.py:1120: UserWarning: Issues encountered while parsing CIF: Some fractional coordinates rounded to ideal values to avoid issues with finite precision.
  warnings.warn("Issues encountered while parsing CIF: " + "\n".join(self.warnings))
 91%|██████████████████████████████████████████████████████████████████████████████▌       | 5566/6091 [01:29<00:08, 63.96it/s]/home/.conda/envs/cdvae/lib/python3.8/site-packages/pymatgen/io/cif.py:1120: UserWarning: Issues encountered while parsing CIF: Some fractional coordinates rounded to ideal values to avoid issues with finite precision.
  warnings.warn("Issues encountered while parsing CIF: " + "\n".join(self.warnings))
 95%|█████████████████████████████████████████████████████████████████████████████████▋    | 5786/6091 [01:33<00:05, 59.95it/s]/home/.conda/envs/cdvae/lib/python3.8/site-packages/pymatgen/io/cif.py:1120: UserWarning: Issues encountered while parsing CIF: Some fractional coordinates rounded to ideal values to avoid issues with finite precision.
  warnings.warn("Issues encountered while parsing CIF: " + "\n".join(self.warnings))
 96%|██████████████████████████████████████████████████████████████████████████████████▏   | 5822/6091 [01:33<00:04, 64.72it/s]/home/.conda/envs/cdvae/lib/python3.8/site-packages/pymatgen/io/cif.py:1120: UserWarning: Issues encountered while parsing CIF: Some fractional coordinates rounded to ideal values to avoid issues with finite precision.
  warnings.warn("Issues encountered while parsing CIF: " + "\n".join(self.warnings))
 98%|████████████████████████████████████████████████████████████████████████████████████▎ | 5974/6091 [01:36<00:01, 66.91it/s]/home/.conda/envs/cdvae/lib/python3.8/site-packages/pymatgen/io/cif.py:1120: UserWarning: Issues encountered while parsing CIF: Some fractional coordinates rounded to ideal values to avoid issues with finite precision.
  warnings.warn("Issues encountered while parsing CIF: " + "\n".join(self.warnings))
100%|██████████████████████████████████████████████████████████████████████████████████████| 6091/6091 [01:39<00:00, 61.48it/s]
/gpfs/fs1/home/cdvae-old/cdvae/cdvae/common/data_utils.py:644: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at  /data/miniconda3/envs/opence-1.7/conda-bld/pytorch-base_1663986328871/work/torch/csrc/utils/tensor_new.cpp:201.)
  targets = torch.tensor([d[key] for d in data_list])
/gpfs/fs1/home/cdvae-old/cdvae/cdvae/common/data_utils.py:612: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  X = torch.tensor(X, dtype=torch.float)
[2023-07-13 16:59:20,540][hydra.utils][INFO] - Instantiating <cdvae.pl_modules.model.CDVAE>
[2023-07-13 16:59:20,615][torch.distributed.nn.jit.instantiator][INFO] - Created a temporary directory at /tmp/tmpwv1glt9u
[2023-07-13 16:59:20,615][torch.distributed.nn.jit.instantiator][INFO] - Writing /tmp/tmpwv1glt9u/_remote_module_non_scriptable.py
[2023-07-13 16:59:53,346][hydra.utils][INFO] - Passing scaler from datamodule to model <StandardScalerTorch(means: -154.2510223388672, stds: 0.13738815486431122)>
[2023-07-13 16:59:53,348][hydra.utils][INFO] - Adding callback <LearningRateMonitor>
[2023-07-13 16:59:53,349][hydra.utils][INFO] - Adding callback <EarlyStopping>
[2023-07-13 16:59:53,350][hydra.utils][INFO] - Adding callback <ModelCheckpoint>
[2023-07-13 16:59:53,354][hydra.utils][INFO] - Instantiating <WandbLogger>
wandb: Currently logged in as: _. Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.15.5
wandb: Run data is saved locally in /home/cdvae-old/cdvae/wabdb/wandb/run-20230713_165954-u04zv43g
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run carbon
wandb: ⭐️ View project at https://wandb.ai/_/crystal_generation_mit
wandb: 🚀 View run at https://wandb.ai/_/crystal_generation_mit/runs/u04zv43g
[2023-07-13 17:00:07,550][hydra.utils][INFO] - W&B is now watching <{cfg.logging.wandb_watch.log}>!
wandb: logging graph, to disable use `wandb.watch(log_graph=False)`
[2023-07-13 17:00:07,588][hydra.utils][INFO] - Instantiating the Trainer
/home/.conda/envs/cdvae/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/callback_connector.py:96: LightningDeprecationWarning: Setting `Trainer(progress_bar_refresh_rate=20)` is deprecated in v1.5 and will be removedin v1.7. Please pass `pytorch_lightning.callbacks.progress.TQDMProgressBar` with `refresh_rate` directly to the Trainer's `callbacks` argument instead. Or, to disable the progress bar pass `enable_progress_bar = False` to the Trainer.
  rank_zero_deprecation(
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
[2023-07-13 17:00:07,650][hydra.utils][INFO] - Starting training!
  0%|                                                                                         | 2/6091 [00:00<32:19,  3.14it/s

I am running on MIST HPC, so I have turned off WandB logging.

Environment

Package                  Version           Editable project location
------------------------ ----------------- --------------------------------------------------
absl-py                  1.4.0
aiofiles                 22.1.0
aiohttp                  3.8.4
aiosignal                1.3.1
aiosqlite                0.18.0
altair                   5.0.1
antlr4-python3-runtime   4.8
anyio                    3.5.0
appdirs                  1.4.4
argon2-cffi              21.3.0
argon2-cffi-bindings     21.2.0
ase                      3.22.0
astor                    0.8.1
astroid                  2.14.2
asttokens                2.0.5
async-timeout            4.0.2
attrs                    22.1.0
autopep8                 2.0.2
av                       9.2.0
Babel                    2.11.0
backcall                 0.2.0
backports.zoneinfo       0.2.1
base58                   2.1.1
beautifulsoup4           4.12.2
bleach                   4.1.0
blinker                  1.6.2
Bottleneck               1.3.5
brotlipy                 0.7.0
cachetools               5.3.1
cdvae                    0.0.1             
certifi                  2023.5.7
cffi                     1.15.1
charset-normalizer       2.0.4
click                    8.0.4
colorama                 0.4.6
comm                     0.1.2
configparser             6.0.0
contourpy                1.0.5
coverage                 7.2.2
cryptography             39.0.1
cycler                   0.11.0
debugpy                  1.5.1
decorator                5.1.1
defusedxml               0.7.1
dill                     0.3.6
distlib                  0.3.6
dnspython                2.3.0
docker-pycreds           0.4.0
emmet-core               0.60.1
entrypoints              0.4
exceptiongroup           1.0.4
executing                0.8.3
fastjsonschema           2.16.2
filelock                 3.12.0
fonttools                4.25.0
frozenlist               1.3.3
fsspec                   2023.4.0
future                   0.18.3
gitdb                    4.0.10
GitPython                3.1.32
google-auth              2.22.0
google-auth-oauthlib     1.0.0
googledrivedownloader    0.4
grpcio                   1.48.2
higher                   0.2.1
html5lib                 1.1
hydra-core               1.1.0
hydra-joblib-launcher    1.1.5
idna                     3.4
importlib-metadata       6.0.0
importlib-resources      5.12.0
iniconfig                1.1.1
ipykernel                6.19.2
ipython                  8.12.0
ipython-genutils         0.2.0
ipywidgets               8.0.4
isodate                  0.6.1
isort                    5.9.3
jedi                     0.18.1
Jinja2                   3.1.2
joblib                   1.2.0
json5                    0.9.6
jsonschema               4.17.3
jupyter_client           8.1.0
jupyter_core             5.3.0
jupyter-events           0.6.3
jupyter_server           2.5.0
jupyter_server_fileid    0.9.0
jupyter_server_terminals 0.4.4
jupyter_server_ydoc      0.8.0
jupyter-ydoc             0.2.4
jupyterlab               3.6.3
jupyterlab-pygments      0.1.2
jupyterlab_server        2.22.0
jupyterlab-widgets       3.0.5
kiwisolver               1.4.4
latexcodec               2.0.1
lazy-object-proxy        1.6.0
lightning-utilities      0.7.1
lxml                     4.9.2
Markdown                 3.4.3
MarkupSafe               2.1.1
matminer                 0.7.3
matplotlib               3.7.1
matplotlib-inline        0.1.6
mccabe                   0.7.0
mistune                  0.8.4
monty                    2023.5.8
mp-api                   0.33.3
mpmath                   1.3.0
msgpack                  1.0.5
multidict                6.0.4
multiprocess             0.70.14
munkres                  1.1.4
nbclassic                0.5.5
nbclient                 0.5.13
nbconvert                6.5.4
nbformat                 5.7.0
nest-asyncio             1.5.6
networkx                 2.8.4
nglview                  3.0.6
notebook                 6.5.4
notebook_shim            0.2.2
numexpr                  2.8.4
numpy                    1.23.5
oauthlib                 3.2.2
omegaconf                2.1.2
p-tqdm                   1.3.3
packaging                23.0
palettable               3.3.3
pandas                   1.5.3
pandocfilters            1.5.0
parso                    0.8.3
pathos                   0.3.0
pathtools                0.1.2
pexpect                  4.8.0
pickleshare              0.7.5
Pillow                   9.4.0
Pint                     0.21.1
pip                      23.1.2
pkgutil_resolve_name     1.3.10
platformdirs             3.2.0
plotly                   5.15.0
pluggy                   1.0.0
pox                      0.3.2
ppft                     1.7.6.6
prometheus-client        0.14.1
promise                  2.3
prompt-toolkit           3.0.36
protobuf                 3.19.6
psutil                   5.9.0
ptyprocess               0.7.0
pure-eval                0.2.2
py                       1.11.0
pyarrow                  8.0.0
pyasn1                   0.5.0
pyasn1-modules           0.3.0
pybtex                   0.24.0
pycodestyle              2.10.0
pycparser                2.21
pydantic                 1.10.11
pydeck                   0.8.1b0
pyDeprecate              0.3.1
pyg-nightly              2.4.0.dev20230711
Pygments                 2.15.1
pylint                   2.16.2
pymatgen                 2023.7.11
pymongo                  4.4.0
pyOpenSSL                23.0.0
pyparsing                3.0.9
pyrsistent               0.18.0
PySocks                  1.7.1
pytest                   7.3.1
pytest-cov               4.0.0
python-dateutil          2.8.2
python-dotenv            1.0.0
python-json-logger       2.0.7
python-louvain           0.15
pytorch-lightning        1.6.5
pytz                     2022.7
PyYAML                   5.4.1
pyzmq                    25.1.0
rdflib                   6.1.1
requests                 2.29.0
requests-oauthlib        1.3.1
rfc3339-validator        0.1.4
rfc3986-validator        0.1.1
rsa                      4.9
ruamel.yaml              0.17.32
ruamel.yaml.clib         0.2.7
scikit-learn             1.2.2
scipy                    1.8.1
Send2Trash               1.8.0
sentencepiece            0.1.96
sentry-sdk               1.28.0
setproctitle             1.3.2
setuptools               67.8.0
shortuuid                1.0.11
six                      1.16.0
SMACT                    2.2.1
smmap                    5.0.0
sniffio                  1.2.0
soupsieve                2.4
spglib                   2.0.2
stack-data               0.2.0
streamlit                0.79.0
subprocess32             3.5.4
sympy                    1.12
tabulate                 0.8.10
tenacity                 8.2.2
tensorboard              2.13.0
tensorboard-data-server  0.7.1
terminado                0.17.1
threadpoolctl            2.2.0
tinycss2                 1.2.1
toml                     0.10.2
tomli                    2.0.1
tomlkit                  0.11.1
toolz                    0.12.0
torch                    1.12.1
torch-cluster            1.6.1
torch-geometric          1.7.2
torch-scatter            2.0.8
torch-sparse             0.6.10
torch-spline-conv        1.2.2
torchdiffeq              0.0.1
torchmetrics             1.0.0
torchtext                0.13.1a0+35066f2
torchvision              0.13.1
tornado                  6.2
tqdm                     4.65.0
traitlets                5.7.1
typing_extensions        4.6.3
tzlocal                  5.0.1
uncertainties            3.1.7
urllib3                  1.26.16
validators               0.20.0
virtualenv               20.22.0
wandb                    0.15.5
watchdog                 3.0.0
wcwidth                  0.2.5
webencodings             0.5.1
websocket-client         0.58.0
Werkzeug                 2.3.6
wheel                    0.38.4
widgetsnbextension       4.0.5
wrapt                    1.14.1
y-py                     0.5.9
yacs                     0.1.6
yarl                     1.9.2
ypy-websocket            0.8.2
zipp                     3.11.0

Any suggestions on how to resolve this? I am not very familiar with Hydra and Pytorch Lightning.