Closed gshs12051 closed 1 year ago
Hi @gshs12051 ,
Thanks for your interest in our work!
What is your PyTorch version? This looks like a familiar bug from PyTorch that should be resolved by updating to the latest support stable version (1.11).
Thanks. I was using Pytorch 1.10 version and after updating to 1.11 version the problem solved. I have two more questions. First question is in the case of MD simulation with MPI. LAMMPS didn't proceed after this stage. mpirun -np 8 lmp -sf omp -pk omp 4 -in in.lammps
run 10
No /omp style for force computation currently active
While it works well in the case of mpirun -np 4 lmp -sf omp -pk omp 8 -in in.lammps like below. I am wondering if there is a specific limit in MPI processor grid size. and sometime MD simluation ends with error below
Unit style : metal
Current step : 0
Time step : 0.0005
Per MPI rank memory allocation (min/avg/max) = 11.64 | 11.64 | 11.64 Mbytes
Step Temp TotEng PotEng Press Volume S/CPU CPULeft
0 1000 -502.17886 -517.56082 4733.1234 3471.2258 0 0
10 1037.6838 -502.17892 -518.14053 4911.4855 3471.2258 0.62056358 96670.196
20 1149.0517 -502.18269 -519.85735 5438.6038 3471.2258 3.3797967 57200.356
30 1366.1239 -502.20265 -523.21631 6466.0332 3471.2258 3.3869016 44029.363
40 1706.0198 -502.27363 -528.51555 8074.8025 3471.2258 3.3691646 37465.69
50 2092.37 -502.46846 -534.6532 9903.4456 3471.2258 0.80146 44927.751
60 2388.6437 -502.88855 -539.63056 11305.746 3471.2258 0.49348786 57677.206
70 2591.1369 -503.60771 -543.46446 12264.171 3471.2258 0.49194526 66832.571
80 2867.5918 -504.70262 -548.81179 13572.666 3471.2258 0.54046821 72327.097
90 3162.0488 -506.21135 -554.84985 14966.367 3471.2258 0.47370461 78332.381
100 3463.3768 -508.07882 -561.35234 16392.59 3471.2258 0.43357856 84302.633
110 3783.0973 -510.20537 -568.39681 17905.867 3471.2258 0.4624285 88399.774
120 4040.5194 -512.46371 -574.61481 19124.277 3471.2258 0.4751138 91522.343
130 3916.8145 -468.80556 -529.05384 18538.766 3471.2258 0.56733556 92585.622
140 4160.4834 -471.28922 -535.2856 19692.081 3471.2258 0.48887372 94704.054
150 5138.8348 -472.80773 -551.85307 24322.739 3471.2258 0.47169693 96834.505
160 5735.4544 -477.15941 -565.38192 27146.614 3471.2258 0.47489075 98642.676
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 39872 RUNNING AT n020
= EXIT CODE: 6
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
Intel(R) MPI Library troubleshooting guide:
https://software.intel.com/node/561764
===================================================================================
And next question is during the training, I tried to use the train set of multiple cell size. (for example some training set of 120 atoms and some training set of 60 atoms) Then the training ended with the errors below.
instantiate NpzDataset
optional_args : key_mapping
optional_args : npz_fixed_field_keys
optional_args : root
optional_args : extra_fixed_fields <- dataset_extra_fixed_fields
optional_args : file_name <- dataset_file_name
...NpzDataset_param = dict(
... optional_args = {'key_mapping': {'z': 'atomic_numbers', 'E': 'total_energy', 'F': 'forces', 'R': 'pos'}, 'include_keys': [], 'npz_fixed_field_keys': ['atomic_numbers'], 'file_name': './train_set.npz', 'url': None, 'force_fixed_keys': [], 'extra_fixed_fields': {'r_max': 4.0}, 'include_frames': None, 'root': 'results/GeSe2'},
... positional_args = {'type_mapper': <nequip.data.transforms.TypeMapper object at 0x2b9f505d7490>})
Traceback (most recent call last):
File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/utils/auto_init.py", line 232, in instantiate
instance = builder(**positional_args, **final_optional_args)
File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/data/dataset.py", line 681, in __init__
super().__init__(
File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/data/dataset.py", line 123, in __init__
super().__init__(root=root, transform=type_mapper)
File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/utils/torch_geometric/dataset.py", line 90, in __init__
self._process()
File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/utils/torch_geometric/dataset.py", line 175, in _process
self.process()
File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/data/dataset.py", line 269, in process
data_list = [
File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/data/dataset.py", line 270, in <listcomp>
constructor(**{**{f: v[i] for f, v in fields.items()}, **fixed_fields})
File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/data/AtomicData.py", line 326, in from_points
return cls(edge_index=edge_index, pos=torch.as_tensor(pos), **kwargs)
File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/data/AtomicData.py", line 221, in __init__
_process_dict(kwargs)
File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/data/AtomicData.py", line 163, in _process_dict
raise ValueError(
ValueError: atomic_numbers is a node field but has the wrong dimension torch.Size([72, 1])
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/gshs12051/anaconda3/envs/pytorch/bin/nequip-train", line 8, in <module>
sys.exit(main())
File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/scripts/train.py", line 74, in main
trainer = fresh_start(config)
File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/scripts/train.py", line 177, in fresh_start
dataset = dataset_from_config(config, prefix="dataset")
File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/data/_build.py", line 78, in dataset_from_config
instance, _ = instantiate(
File "/home/gshs12051/anaconda3/envs/pytorch/lib/python3.8/site-packages/nequip/utils/auto_init.py", line 234, in instantiate
raise RuntimeError(
RuntimeError: Failed to build object with prefix `dataset` using builder `NpzDataset`
Hi @gshs12051 ,
Great, glad it resolved your issue!
Could you please open a new issue on pair_allegro
(this repo) for the MPI question, and a separate issue on the nequip
repo for the training issue? This helps keep information searchable and organized for future users.
Thanks!
I trained Allegro model in GeSe system and deployed the model to lammps pair potential following the steps. and I tried to do MD simulation using lammps input below.
Lammps works well in the case of run1 but in the case of run more than 1 (ex run 2 or more) lammps terminate with the errors below.
in the case of run 1, lammps ended successfully with the output below