Closed vsumaria closed 3 years ago
@vsumaria were are able to solve the problem? Does the descriptor work fine using serial computations or on other machines?
I had to reduce the number of cores over which I was running the job and reduce the training set size significantly to be able to run without memory issues.
I had to reduce the number of cores over which I was running the job and reduce the training set size significantly to be able to run without memory issues.
Memory can be an issue when the size of the features is large, while the allowed memory is insufficient. For bispectrum coefficients, the training data, the twojmax parameter as well as the linear or quadratic form can affect the feature dimension. However, 7500 structures should be fine.
Have you tired running the simulation with n_job = 1 and stilled failed with 7500 structure?
Have you tried to use a machine with larger allowed memory and got the issue somehow abated?
Above all, this issue should resolve when we allow sufficient memory, and 7500 structures should not be a challenge. If you still cannot use the full 7500 set for your training, you may provide us with your training set as well as the script you used for your training. We will look into it and see if we can reproduce the error.
I agree. I had to move on from this particular project so could not try to run the calculation with n_job = 1, but I think it would have worked too.
Can there be a way to estimate the memory before generating the descriptors (which when using n_job=1 would take quite some time) so that a training run can be planned accordingly ?
I agree. I had to move on from this particular project so could not try to run the calculation with n_job = 1, but I think it would have worked too.
Can there be a way to estimate the memory before generating the descriptors (which when using n_job=1 would take quite some time) so that a training run can be planned accordingly ?
From my knowledge, the largest memory consumption comes from the array of bispectrum coefficients of the structure list. For one structure with 60 atoms, we have (1 (energy) + 3 60 (forces) + 6 (stress)) = 187 set of descriptors. And the size of the descriptor depends on twojmax and linear of quadratic form of the SNAP. Here, we may assume twojmax = 6 and the SNAP is not quadratic, so there are 187 31 = 5797 hyper parameters for one structure. 7500 structures (assume all have 60 atoms) refers to 7500 * 5797 = 43477500 hyper parameters. A numpy array of this size consumes around 350 MB memory. So I really don't think 7500 structure can cause a large memory consumption.
It can be that your bispectrum coefficients have a very high complexity, or the space on your machine is very limited, or there is uncleaned history that shrinks the space. Again, if you still cannot use the full 7500 set for your training, you may provide us with your training set as well as the script you used for your training. We may look into it and see if we can reproduce the error. 😁
Here is one code I am running right now for ~1500 structures in the training with 108 atoms in each structure.
from ase.io.trajectory import Trajectory
from pymatgen.io.ase import AseAtomsAdaptor
import json
import numpy as np
from maml.utils import pool_from, convert_docs
from maml.base import SKLModel
from maml.describers import BispectrumCoefficients
from sklearn.linear_model import LinearRegression
from maml.apps.pes import SNAPotential
from ase.io import *
train_energies=[]
train_forces = []
train_structures=[]
train_stresses = []
ase_adap = AseAtomsAdaptor()
images = read('train2.traj',':')
for i,atoms in enumerate(images):
train_energies.append(atoms.get_potential_energy())
train_forces.append(atoms.get_forces())
atoms.set_pbc([1,1,1])
train_structures.append(ase_adap.get_structure(atoms))
train_pool = pool_from(train_structures, train_energies, train_forces)
_, df = convert_docs(train_pool, include_stress=False)
weights = np.ones(len(df['dtype']), )
weights[df['dtype'] == 'force'] = 1
weights[df['dtype'] == 'energy'] = 100000
element_profile = {'Cu': {'r': 5, 'w': 1}, 'Zr': {'r': 5, 'w': 1}, 'Al': {'r': 5, 'w': 1}, 'Nb': {'r': 5, 'w': 1}}
describer = BispectrumCoefficients(rcutfac=1, twojmax=10, element_profile=element_profile, quadratic=True, pot_fit=True, include_stress=False, n_jobs=8, verbose=True)
ml_model = LinearRegression()
skl_model = SKLModel(describer=describer, model=ml_model)
snap = SNAPotential(model=skl_model)
snap.train(train_structures, train_energies, train_forces, include_stress=False, sample_weight=weights)
snap.write_param()
I am running into mostly memory issues since after the descriptors have been calculated the job just gets killed.
Here is one code I am running right now for ~1500 structures in the training with 108 atoms in each structure.
from ase.io.trajectory import Trajectory from pymatgen.io.ase import AseAtomsAdaptor import json import numpy as np from maml.utils import pool_from, convert_docs from maml.base import SKLModel from maml.describers import BispectrumCoefficients from sklearn.linear_model import LinearRegression from maml.apps.pes import SNAPotential from ase.io import * train_energies=[] train_forces = [] train_structures=[] train_stresses = [] ase_adap = AseAtomsAdaptor() images = read('train2.traj',':') for i,atoms in enumerate(images): train_energies.append(atoms.get_potential_energy()) train_forces.append(atoms.get_forces()) atoms.set_pbc([1,1,1]) train_structures.append(ase_adap.get_structure(atoms)) train_pool = pool_from(train_structures, train_energies, train_forces) _, df = convert_docs(train_pool, include_stress=False) weights = np.ones(len(df['dtype']), ) weights[df['dtype'] == 'force'] = 1 weights[df['dtype'] == 'energy'] = 100000 element_profile = {'Cu': {'r': 5, 'w': 1}, 'Zr': {'r': 5, 'w': 1}, 'Al': {'r': 5, 'w': 1}, 'Nb': {'r': 5, 'w': 1}} describer = BispectrumCoefficients(rcutfac=1, twojmax=10, element_profile=element_profile, quadratic=True, pot_fit=True, include_stress=False, n_jobs=8, verbose=True) ml_model = LinearRegression() skl_model = SKLModel(describer=describer, model=ml_model) snap = SNAPotential(model=skl_model) snap.train(train_structures, train_energies, train_forces, include_stress=False, sample_weight=weights) snap.write_param()
I am running into mostly memory issues since after the descriptors have been calculated the job just gets killed.
I see. From my understanding, quadratic=True and twojmax=10 will significantly increase the model complexity. There will be more than 1000 hyper parameters in the descriptors (also depends on No. of atomic types in your system), much larger than 31 in the case I discussed above. This explains why there is memory error.
I won't be able to make any straight forward suggestion in this case, but from my knowledge, SNAP potentials can be already accurate for many systems with quadratic=False and twojmax=6 or 8. You may consider decrease your model complexity.
By the way, you may close the issue if you feel the problem is already resolved. Good luck!
I am trying to use the Bispectrum Coefficients based SNAP potential for my training with ~7500 structures but ending up with some memory issue:
"Some of your processes may have been killed by the cgroup out-of-memory handler"
I am using parallel descriptor construction with the n_jobs tag.
Any advice what I might be doing wrong?