Issue with .create() method when recreating SiC NN example

michaelmacisaac commented 2 years ago

Good afternoon,

I am attempting to develop a Si and a C potential using some SiC VASP data I have collected. My VASP data included ab initio molecular dynamics runs where a SiC structure was held at 12 different temperatures for 500 time steps, resulting in 6000 structures. I thought that my data/xyz files may not be in a usable form for the kliff functions, however in my variable explorer it appears 6000 configs were made. Considering these configs I assume my .xyz files are formatted correctly and my issues are not stemming from the differences in data used. I am using a similar approach to the NN SiC example provided in the 'examples' folder. When I run my script my code seems to stop at the line: '_ = calc.create(configs,reuse=False)' When I first run my script the following is outputted in the console:

runfile('/home/michaelmacisaac/blue_subhash/michaelmacisaac/SiC/SiC-3C/minimodel/scripts/kimkliffscripts/kimkliffmodel.py', wdir='/home/michaelmacisaac/blue_subhash/michaelmacisaac/SiC/SiC-3C/minimodel/scripts/kimkliffscripts')

2022-09-07 12:19:52.640 | INFO     | kliff.dataset.dataset:_read:397 - 6000 configurations read from /blue/subhash/michaelmacisaac/SiC/SiC-3C/minimodel/scripts/kimkliffscripts/xyz_files/temperature

2022-09-07 12:19:52.646 | INFO     | kliff.calculators.calculator_torch:_get_device:417 - Training on cpu

2022-09-07 12:19:52.649 | INFO     | kliff.descriptors.descriptor:generate_fingerprints:104 - Start computing mean and stdev of fingerprints.

After waiting awhile, I stop my code and the following is outputted in the console (I have started and stopped my code multiple times and the output is the same):

Traceback (most recent call last):

  File "/home/michaelmacisaac/blue_subhash/michaelmacisaac/SiC/SiC-3C/minimodel/scripts/kimkliffscripts/kimkliffmodel.py", line 60, in <module>
    _=calc.create(configs,reuse=False,use_welford_method=False,nprocs=1)

  File "/home/michaelmacisaac/.conda/envs/kimkliff/lib/python3.7/site-packages/kliff/calculators/calculator_torch.py", line 123, in create
    nprocs,

  File "/home/michaelmacisaac/.conda/envs/kimkliff/lib/python3.7/site-packages/kliff/descriptors/descriptor.py", line 115, in generate_fingerprints
    configs, fit_forces, fit_stress, nprocs

  File "/home/michaelmacisaac/.conda/envs/kimkliff/lib/python3.7/site-packages/kliff/descriptors/descriptor.py", line 230, in _calc_zeta_dzetadr
    z, dzdr_f, dzdr_s = self.transform(conf, fit_forces, fit_stress)

  File "/home/michaelmacisaac/.conda/envs/kimkliff/lib/python3.7/site-packages/kliff/descriptors/symmetry_function/sym_fn.py", line 159, in transform
    i, coords, species, neighlist, grad

  File "/home/michaelmacisaac/.conda/envs/kimkliff/lib/python3.7/site-packages/numpy/core/_internal.py", line 595, in _dtype_from_pep3118
    def _dtype_from_pep3118(spec):

KeyboardInterrupt

I am unsure of how to resolve the issue and would appreciate any assistance?

Thanks

My code for my SiC nn is below:

#Imports
from kliff import nn
from kliff.calculators.calculator_torch import CalculatorTorchSeparateSpecies
from kliff.calculators.calculator_torch import CalculatorTorch
from kliff.descriptors.symmetry_function.sym_fn import SymmetryFunction
from kliff.dataset import Dataset
from kliff.models import NeuralNetwork
from kliff.loss import Loss
from kliff.dataset.weight import Weight

#Descriptor for atomic configuration
descriptor = SymmetryFunction(
    cut_name="cos", cut_dists={"Si-Si": 5.0, "C-C": 5.0, "Si-C": 5.0}, hyperparams="set51", normalize=True)

#Data
dataset_path='/blue/subhash/michaelmacisaac/SiC/SiC-3C/minimodel/scripts/kimkliffscripts/xyz_files/temperature'
weight=Weight(energy_weight=0.5, forces_weight=0.5)
data=Dataset(dataset_path,weight)
configs=data.get_configs()

#Neural Networks
N1=20 #Neurons in first hidden layer
modelsi=NeuralNetwork(descriptor)
modelsi.add_layers(
    nn.Linear(descriptor.get_size(),N1),
    nn.LeakyReLU(negative_slope=0.25),
    nn.Linear(N1,1),
    #nn.LeakyReLU(negative_slope=0.25)
    )
modelsipath='/blue/subhash/michaelmacisaac/SiC/SiC-3C/minimodel/scripts/kimkliffscripts/models/8_29_22/kliff_saved_modelsi'
modelsi.set_save_metadata(prefix=modelsipath,start=5,frequency=2)

N1=20 
modelc=NeuralNetwork(descriptor)
modelc.add_layers(
    nn.Linear(descriptor.get_size(),N1),
    nn.LeakyReLU(negative_slope=0.25),
    nn.Linear(N1,1),
    #nn.LeakyReLU(negative_slope=0.25)
    )
modelcpath='/blue/subhash/michaelmacisaac/SiC/SiC-3C/minimodel/scripts/kimkliffscripts/models/8_29_22/kliff_saved_modelc'
modelc.set_save_metadata(prefix=modelcpath,start=5,frequency=2)

#calculator 
calc=CalculatorTorchSeparateSpecies(models={"Si": modelsi, "C": modelc}, gpu=False)
_=calc.create(configs,reuse=False,use_welford_method=False,nprocs=1)

#loss
loss = Loss(calc)
result=loss.minimize(method="Adam", num_epochs=10, batch_size=4, lr=0.001)

modelsi.save("finalmodelsi.pkl")
modelc.save("finalmodelc.pkl")
loss.save_optimizer_state("optimizer_stat.pkl")

mjwen commented 2 years ago

Hi @michaelmacisaac, you code seems alright to me and I think it is working as expected.

You saw Start computing mean and stdev of fingerprints and it did not proceed. This means it is still working on generating the mean and standard deviations. Note that this is computation intensive and it make take a while given that you have 6000 configurations. If you are running on your laptop, the memory can be a bottleneck. To get around it, you can set use_welford_method=True to use the Welford algorithm to compute the mean and statistics. The good news is that this computational intensive part only needs to be done once at the beginning.

So, if you wait a big longer, it should get to the next steps. Also, if you want to see whether it works or not, probably you can try first use fewer configurations, say 100 to do a quick run. Once you think it works, then try fitting using the entire 6000 configurations.

michaelmacisaac commented 2 years ago

@mjwen Thank you for the prompt reply! I will try and create a smaller batch of fingerprints like you suggested! I am wanting to develop SiC potentials (a potential for Si and another for C) for use in LAMMPS simulations, but before I implement the potentials in LAMMPS, I want to test the potential using a hold out test set as well as implement cross-validation in potential(s) training (e.g. use 5-fold cross validation and save the best performing models). I am wondering if you have any suggestions on where I should look in the documentation to find out how to evaluate the developed potentials without implementing them in LAMMPS, i.e. is there a way to use the two saved .pkl files and evaluate their ability to predict the energies of a test set of SiC structures?

mjwen commented 2 years ago

Yes. After a model is trained, you can reloaded it and then check the prediction error on a test set. The EnergyForcesRMSE analyzer might be what you are looking for?

michaelmacisaac commented 2 years ago

@mjwen Thanks for that tip! I understand how one would compare model performance for a single element system, but I am unsure how to evaluate potentials performance for multielement systems i.e. if I have a system of 2,3,4, etc elements, how does one evaluate their combined ability to predict energies, positions, forces, etc? At the moment I am working with this SiC example, but in the future I want to do work with larger design spaces, like HEAs.

Additionally, I was looking through the "CalculatorTorchSeparateSpecies" documentation, and it appears that the model corresponding to only the last element of the dictionary can make use of the "set_save_metadata" method. Does this mean that only the last model is optimized, or is it that both models (Si and C) are optimized, but only the model corresponding to the last element is iteratively saved, while the model corresponding to the first element in the dictionary is saved at the the end of training?

Finally, if it is the case that only the model corresponding to the final element in the dictionary is optimized, would it be advantageous to employ multiple calculators such that for one of the calculators the last element in the dictionary is Si and for the other calculator the last element is C? I have also been wondering if the "WrapperCalculator" may be useful for multi element systems, because when making predictions, the predictions are the sum of the predictions of the calculators.

I have really enjoyed learning the KimKliff package, and am excited to continue using it in my work, insight on these matters would be greatly appreciated!

mjwen commented 2 years ago

Hi @michaelmacisaac

@mjwen Thanks for that tip! I understand how one would compare model performance for a single element system, but I am unsure how to evaluate potentials performance for multielement systems i.e. if I have a system of 2,3,4, etc elements, how does one evaluate their combined ability to predict energies, positions, forces, etc? At the moment I am working with this SiC example, but in the future I want to do work with larger design spaces, like HEAs.

We should be able to test multiple species using an analyzer as in this script, but need to change Calculator to CalculatorTorchSeparateSpecies, and load the corresponding models.

Additionally, I was looking through the "CalculatorTorchSeparateSpecies" documentation, and it appears that the model corresponding to only the last element of the dictionary can make use of the "set_save_metadata" method. Does this mean that only the last model is optimized, or is it that both models (Si and C) are optimized, but only the model corresponding to the last element is iteratively saved, while the model corresponding to the first element in the dictionary is saved at the the end of training?

The loss should optimize all model parameters. But you are right, there is a bug in the current CalculatorTorchSeparateSpecies, which leads to the optimization of the params of the last model only. Thanks for identifying this; I'll fix it and let you know.

Finally, if it is the case that only the model corresponding to the final element in the dictionary is optimized, would it be advantageous to employ multiple calculators such that for one of the calculators the last element in the dictionary is Si and for the other calculator the last element is C? I have also been wondering if the "WrapperCalculator" may be useful for multi element systems, because when making predictions, the predictions are the sum of the predictions of the calculators.

Exactly, CalculatorTorchSeparateSpecies is along the same line as WrapperCalculator, WrapperCalculator is targeted for physics-based models, not machine learning ones. I will fix the CalculatorTorchSeparateSpecies to make it work. That being said, as you know, we are working on a more generic NN models as discussed in #45. Hopefully, it'll be ready soon.

I have really enjoyed learning the KimKliff package, and am excited to continue using it in my work, insight on these matters would be greatly appreciated!

mjwen commented 2 years ago

hi @michaelmacisaac I've got this fixed. Now if you use CalculatorTorchSeparateSpecies, all parameters will be trained. The fix is merged into the master branch.

michaelmacisaac commented 2 years ago

Hi @mjwen Thank you for the tips! I will use the provided script and modify it to include multiple species/models. Thank you for modifying the "CalculatorTorchSeparateSpecies", I look forward to using it.

michaelmacisaac commented 2 years ago

Hi @mjwen I am wondering what the specific output of the loss function is? I have been looking at loss.py and it appears that the epoch_loss reported is not in terms of loss (energy) per atom (e.g. eV/atom), but instead is a summation of the average losses per batch. If I am wrong on this please let me know!

mjwen commented 2 years ago

@michaelmacisaac Sorry, I missed this one.

Typically, the loss will contain both energy contribution and force contributions. Please see this. In addition to this, one can choose to normalize the total energy by the number of atoms in a configuration by setting residual_data['normalize_by_natoms'] = True when creating the Loss object.

And you are right, the loss is also normalized by the batch size.

openkim / kliff

Issue with .create() method when recreating SiC NN example #71