mir-group / flare_pp

A many-body extension of the FLARE code.
MIT License
35 stars 6 forks source link

Help for processing large dataset #44

Open mhsiron opened 1 year ago

mhsiron commented 1 year ago

Hello,

I am using the following code (very similar to the on in Google Collab) for a dataset with ~3,000 structures of 150-190 atoms each. I have a cluster with 128GB and one with 384GB. On both, the code uses all the memory and crashes typically at setting up the training data.

from ase.io import read
import numpy as np
import flare_pp._C_flare as flare_pp
from flare_pp.sparse_gp import SGP_Wrapper
from flare_pp.sparse_gp_calculator import SGP_Calculator

dataset = read('processed_dataset.xyz', index=":", format='extxyz')

n_strucs = len(dataset)
forces = [x.get_forces() for x in dataset]
positions = [x.get_positions() for x in dataset]
cell = np.array(dataset[0].cell.tolist())
species = dataset[0].get_atomic_numbers()
species_code = {k:n for n,k in enumerate(set(dataset[0].get_atomic_numbers()))}

coded_species = []
for spec in species:
    coded_species.append(species_code[spec])

# Choose training and validation structures.
training_size = int(n_strucs*.8)
validation_size = n_strucs-int(n_strucs*.8)
shuffled_frames = [int(n) for n in range(n_strucs)]
np.random.shuffle(shuffled_frames)

training_pts = shuffled_frames[0:training_size]
validation_pts = shuffled_frames[training_size:training_size + validation_size]

parameters = {}
# Define many-body descriptor.
cutoff = parameters.get("cutoff",7)
n_species = len(set(species))
N = parameters.get("n_max",15)  # Number of radial basis functions
lmax = parameters.get("l_max",3)  # Largest L included in spherical harmonics
radial_basis = parameters.get("radial_basis_function","chebyshev")  # Radial basis set
cutoff_name = parameters.get("cutoff_function","quadratic")  # Cutoff function

radial_hyps = [0, cutoff]
cutoff_hyps = []
descriptor_settings = [n_species, N, lmax]

# Define a B2 object.
B2 = flare_pp.B2(radial_basis, cutoff_name, radial_hyps, cutoff_hyps,
                 descriptor_settings)

# The GP class can take a list of descriptors as input, but here
# we'll use a single descriptor.
descriptors = [B2]

# Define kernel function.
sigma = parameters.get("sigma",2.0)
power = parameters.get("power",2)
dot_product_kernel = flare_pp.NormalizedDotProduct(sigma, power)

# Define a list of kernels.
# There needs to be one kernel for each descriptor.
kernels = [dot_product_kernel]

# Define sparse GP.
sigma_e = parameters.get("sigma_e",0.005 * noa)  # Energy noise (in kcal/mol, so about 5 meV/atom)
sigma_f = parameters.get("sigma_f",0.005)  # Force noise (in kcal/mol/A, so about 5 meV/A)
sigma_s = parameters.get("sigma_s",0.0007)  # Stress noise (in kcal/A^3, so about 0.1 GPa)
gp_model = flare_pp.SparseGP(kernels, sigma_e, sigma_f, sigma_s)

# Calculate descriptors of the validation and training structures.
print("Computing descriptors of validation points...")
validation_strucs = []
validation_forces = [] 
for n, snapshot in enumerate(validation_pts):
    pos = positions[snapshot]
    frcs = forces[snapshot]

    # Create structure object, which computes and stores descriptors.
    struc = \
        flare_pp.Structure(cell, coded_species, pos, cutoff, descriptors)
    validation_strucs.append(struc)
    validation_forces.append(frcs)
print("Done.")

print("Computing descriptors of training points...")
training_strucs = []
training_forces = [] 

## CODE TYPICALLY CRASHES IN LOOP BELOW usually anywhere from within 20 iteration to 160 iteration depending on cluster used:

for n, snapshot in enumerate(training_pts):
    pos = positions[snapshot]
    frcs = forces[snapshot]
    # Create structure object, which computes and stores descriptors.
    struc = \
        flare_pp.Structure(cell, coded_species, pos, cutoff, descriptors)

    # Assign force labels to the training structure.
    struc.forces = frcs.reshape(-1)

    training_strucs.append(struc)
    training_forces.append(frcs)
print("Done.")

# Train the model.
print("Training the GP...")
batch_size = 50  # monitor the MAE after adding this many frames
n_strucs = np.zeros(batch_size)
mb_maes = np.zeros(batch_size)
for m in range(training_size):
    train_struc = training_strucs[m]
    # Add training structure and sparse environments.
    gp_model.add_training_structure(train_struc)
    gp_model.add_all_environments(train_struc)

    if (m + 1) % batch_size == 0:
        # Update the sparse GP training coefficients.
        gp_model.update_matrices_QR()

        # Predict on the validation set.
        pred_forces = [] #np.zeros((validation_size, noa, 3))
        for n, test_struc in enumerate(validation_strucs):
            gp_model.predict_SOR(test_struc)
            c_noa = test_struc.noa
            pred_vals = test_struc.mean_efs[1:-6].reshape(c_noa, 3)
            pred_forces.append(pred_vals)

        # Calculate and store the MAE.
        batch_no = int((m + 1) / batch_size)
        v_f = np.fromiter(chain.from_iterable(validation_forces))
        p_f = np.fromiter(chain.from_iterable(pred_forces))
        mae = np.mean(np.abs(v_f - p_f))
        n_strucs[batch_no - 1] = batch_size * batch_no
        mb_maes[batch_no - 1] = mae
        print("Batch %i MAE: %.2f eV/atom/A" % (batch_no, mae))
# Write lammps potential file.
file_name = "trained_sparse_gaussian_model.txt"
contributor = self.get("model_description")

# The "kernel index" indicates which kernel to map for multi-descriptor models.
# For single-descriptor models like this one, just set it to 0.
kernel_index = 0

gp_model.write_mapping_coefficients(file_name, contributor, kernel_index)

Is there a more efficient way to load the data as needed so that the system memory is not completely used up?

YuuuXie commented 1 year ago

Hi @mhsiron, here are a few suggestions

  1. to train the model, you don't need to load the whole dataset into Structure class objects before the training (adding them into GP). Instead, you can create the Structure every time before calling gp_model.add_training_structure.
  2. you are using cutoff=7, n_max=15, which makes the descriptor dimensions large, you might want to check whether you actually need this setting. You can do a bit test with a small dataset
  3. you have 3000 frames, but presumably there should be some similar atomic environments, so you don't really need to add all the frames and all the environments like
    gp_model.add_training_structure(train_struc)
    gp_model.add_all_environments(train_struc)

    I would suggest you to select frames and environments by uncertainty, for which you can use our offline training workflow.

For the offline training, we recently merge the whole flare_pp repo into flare (development branch), and you can check the offline training tutorial

mhsiron commented 1 year ago

Thank you @YuuuXie, I have compiled the development branch of Flare. And I am following the on the fly / fake calculator test. I am getting the following outputs:

Precomputing KnK for hyps optimization
Done precomputing. Time: 0.00011658668518066406
Hyperparameters:
[2.e+00 1.e-01 5.e-02 1.e-03]
Likelihood gradient:
[           nan 1.98496392e+09 3.32521729e+07 8.78376850e+14]
Likelihood:
-439288503875.8221

Precomputing KnK for hyps optimization
Done precomputing. Time: 0.025263547897338867
Hyperparameters:
[2.e+00 1.e-01 5.e-02 1.e-03]
Likelihood gradient:
[           nan 4.85412292e+07 3.79961899e+12 3.22852447e+13]
Likelihood:
-112246911046.97923

Precomputing KnK for hyps optimization
Done precomputing. Time: 0.05576014518737793
Hyperparameters:
[2.e+00 1.e-01 5.e-02 1.e-03]
Likelihood gradient:
[           nan 7.67247922e+06 1.19415661e+12 7.94622600e+12]
Likelihood:
-38366412460.39977

How am I to interpret this magnitude of likelihood? And why is the first component of my likelihood gradient nan? It also appears the hyperparameters are not changing.

Here is my config file:

supercell: 
    file: processed_data.xyz                                              # Use previously generated DFT frames as input
    format: extxyz
    index: 0
    replicate: [1, 1, 1]                                                        # Do not replicate periodically
    jitter: 0.0                                                                 # Do not jitter atoms, since we our input is DFT frames

# Set up FLARE calculator with (sparse) Gaussian process                        # This section stays the same as previous
flare_calc:
    gp: SGP_Wrapper
    kernels:
        - name: NormalizedDotProduct
          sigma: 2
          power: 2
    descriptors:
        - name: B2
          nmax: 8
          lmax: 3
          cutoff_function: quadratic
          radial_basis: chebyshev
          cutoff_matrix: [[5.0,5.0],[5.0,5.0]]
    energy_noise: 0.1
    forces_noise: 0.05
    stress_noise: 0.001
    species:
        - 22
        - 8
    single_atom_energies:
        - 0
        - 0
    cutoff: 7.0
    variance_type: local
    max_iterations: 20
    use_mapping: True

# Set up DFT calculator, it can be any calculator supported by ASE
dft_calc:
    name: FakeDFT                                                               # We are going to perform "FakeDFT" since our extxyz input already contains the DFT labels
    kwargs: 
        filename: processed_data.xyz                                                   # Point to file containing DFT frames
        format: extxyz
        index: ":"
        io_kwargs: {}
        properties: [forces,energy]
    params: {}

# Set up On-the-fly training and MD
otf: 
    mode: fresh                                                                 # Start from empty SGP
    md_engine: Fake                                                             # Do not perform MD, just read frames sequentially
    md_kwargs: 
        filenames: [processed_data.xyz]
        format: extxyz
        index: ":"
        io_kwargs: {}
    initial_velocity: file
    dt: 0.001                                                                   # This value is arbitrary in this setting
    number_of_steps: 100000                                                     # Set to a value greater than the number of your DFT frames
    output_name: offline                                                        # output name
    init_atoms: [0, 90]                                                         # init atoms from first frame to add to sparse set
    std_tolerance_factor: -0.01
    max_atoms_added: -1
    train_hyps: [0,inf]
    write_model: 4
    update_style: threshold
    update_threshold: 0.001
    force_only: False 
mhsiron commented 1 year ago

I tried Flare branch 1.1.2 and that did not show "nan" for the first gradient component however it failed on: Hyperparameters: [ 1.39119812e+12 -1.50422897e+03 2.07407411e+00 3.31245338e+02] Likelihood gradient: [ 1.45688477e+03 3.87175313e+11 -1.91296921e+14 -2.05101010e+01] Likelihood: -5192.678368096951

Traceback (most recent call last): File "/nfs/site/disks/msironqchem1/cond2/bin/flare-otf", line 8, in sys.exit(main()) File "/nfs/site/disks/msironqchem1/cond2/lib/python3.8/site-packages/flare/scripts/otf_train.py", line 372, in main fresh_start_otf(config) File "/nfs/site/disks/msironqchem1/cond2/lib/python3.8/site-packages/flare/scripts/otf_train.py", line 339, in fresh_start_otf otf.run() File "/nfs/site/disks/msironqchem1/cond2/lib/python3.8/site-packages/flare/learners/otf.py", line 381, in run self.checkpoint() File "/nfs/site/disks/msironqchem1/cond2/lib/python3.8/site-packages/flare/learners/otf.py", line 861, in checkpoint json.dump(self.as_dict(), f, cls=NumpyEncoder) File "/nfs/site/disks/msironqchem1/cond2/lib/python3.8/site-packages/flare/learners/otf.py", line 782, in as_dict self.flare_calc.write_model(self.flare_name) File "/nfs/site/disks/msironqchem1/cond2/lib/python3.8/site-packages/flare/bffs/sgp/calculator.py", line 158, in write_model json.dump(self.as_dict(), f, cls=NumpyEncoder) File "/nfs/site/disks/msironqchem1/cond2/lib/python3.8/site-packages/flare/bffs/sgp/calculator.py", line 139, in as_dict out_dict["gp_model"] = self.gp_model.as_dict() File "/nfs/site/disks/msironqchem1/cond2/lib/python3.8/site-packages/flare/bffs/sgp/sparse_gp.py", line 179, in as_dict custom_range = self.sparse_gp.sparse_indices[0][s] IndexError: list index out of range

mhsiron commented 1 year ago

If I print "s" and "self.sparse_gp.sparse_indices" I get a sparse_indices array like so: [[[0,90],[1,5,6,7,...]]]

while s >1, so when s==2 I get the out of range error. Should my sparse array be shaped like so: [[9,90,1,5,6, ...]] ?

Is this a problem with the wrapper? or the C library?

I edited line 179 to be custom_range = [item for sublist in self.sparse_gp.sparse_indices[0] for item in sublist][s] and the code appears to be working.

YuuuXie commented 1 year ago

Hi @mhsiron , as for the nan issue, you might want to check whether your first frame is a perfect crystal lattice, or might have identical/highly similar atomic environments, so when you add them, they will make the kernel matrix ill conditioned.

YuuuXie commented 1 year ago

As for the IndexError, I don't know why this error would arise, maybe after you change the init_atoms it will help

Also a minor point is to set train_hyps: [20,200] for example, so you will not train the hyper parameters when there are fewer than 20 frames or more than 200 frames in the model. If there are too few frames, the training might not be stable. If there are too many, the hyps usually changes little but the training takes very long.

mhsiron commented 1 year ago

Hi @mhsiron , as for the nan issue, you might want to check whether your first frame is a perfect crystal lattice, or might have identical/highly similar atomic environments, so when you add them, they will make the kernel matrix ill conditioned.

Thanks @YuuuXie -- just to double check, if my extxyz file contains multiple structures with different lattices/number of atoms, is that config file not correct? Can the OTF trainer script be used?

mhsiron commented 1 year ago

As for the IndexError, I don't know why this error would arise, maybe after you change the init_atoms it will help

Also a minor point is to set train_hyps: [20,200] for example, so you will not train the hyper parameters when there are fewer than 20 frames or more than 200 frames in the model. If there are too few frames, the training might not be stable. If there are too many, the hyps usually changes little but the training takes very long.

Great, thanks, will try that!

YuuuXie commented 1 year ago

Hi @mhsiron , as for the nan issue, you might want to check whether your first frame is a perfect crystal lattice, or might have identical/highly similar atomic environments, so when you add them, they will make the kernel matrix ill conditioned.

Thanks @YuuuXie -- just to double check, if my extxyz file contains multiple structures with different lattices/number of atoms, is that config file not correct? Can the OTF trainer script be used?

Hi @mhsiron , if you have multiple different sizes, that definitely works for the yaml, and you can set the kwargs: {} for FakeDFT, and can also list multiple data files to feed to GP.

# Set up DFT calculator, it can be any calculator supported by ASE
dft_calc:
    name: FakeDFT                                                               # We are going to perform "FakeDFT" since our extxyz input already contains the DFT labels
    kwargs: {}
    params: {}

# Set up On-the-fly training and MD
otf: 
    mode: fresh                                                                 # Start from empty SGP
    md_engine: Fake                                                             # Do not perform MD, just read frames sequentially
    md_kwargs: 
        filenames: [frames_1.xyz, frames_2.xyz, frames_3.xyz]
mhsiron commented 1 year ago

Hello @YuuuXie and Flare Team,

Happy new years! So I have the FakeDFT up and running however I am a bit concerned about the stability of the error metrics.

Here is my sample config.yaml:

dft_calc:
  kwargs: {}
  name: FakeDFT
  params: {}
flare_calc:
  cutoff: 4
  descriptors:
  - cutoff_function: quadratic
    cutoff_matrix:
    - - 5.0
    lmax: 3
    name: B2
    nmax: 8
    radial_basis: chebyshev
  energy_noise: 0.1
  forces_noise: 0.05
  gp: SGP_Wrapper
  kernels:
  - name: NormalizedDotProduct
    power: 2
    sigma: 2
  max_iterations: 40
  single_atom_energies:
  - 0
  species:
  - 6
  stress_noise: 0.001
  use_mapping: true
  variance_type: local
otf:
  dt: 0.001
  force_only: false
  init_atoms:
  - 0
  - 1
  - 2
  - 3
  - 4
  - 5
  - 6
  - 7
  - 8
  - 9
  initial_velocity: file
  max_atoms_added: -1
  md_engine: Fake
  md_kwargs:
    filenames:
    - processed_dataset.xyz
    format: extxyz
    index: ':'
    io_kwargs: {}
  mode: fresh
  number_of_steps: 100000
  output_name: trained
  std_tolerance_factor: -0.01
  train_hyps:
  - 20
  - inf
  update_style: threshold
  update_threshold: 0.01
  write_model: 4
supercell:
  file: processed_dataset.xyz
  format: extxyz
  index: 0
  jitter: 0
  replicate:
  - 1
  - 1
  - 1

In terms of errors, for example, for energy, after each iteration I get metrics like these:

energy mae: 5.3784 eV
energy mae: 6.4629 eV
energy mae: 6.2668 eV
energy mae: 5.5638 eV
energy mae: 6.1020 eV
energy mae: 6.8599 eV
energy mae: 9.4289 eV
energy mae: 8.9882 eV
energy mae: 9.6739 eV
energy mae: 10.8812 eV
energy mae: 10.2682 eV
energy mae: 8.8836 eV
energy mae: 7.2134 eV
energy mae: 6.6178 eV
energy mae: 6.4719 eV
energy mae: 6.6006 eV
energy mae: 6.8869 eV
energy mae: 7.0557 eV
energy mae: 8.3870 eV
energy mae: 7.9107 eV
energy mae: 9.1806 eV
energy mae: 8.4187 eV
energy mae: 8.2424 eV
energy mae: 9.2823 eV
energy mae: 8.7230 eV
energy mae: 6.4926 eV
energy mae: 7.4831 eV
energy mae: 7.2246 eV
energy mae: 6.7165 eV
energy mae: 8.5676 eV
energy mae: 7.5140 eV
energy mae: 5.9606 eV
energy mae: 5.1245 eV
energy mae: 4.4561 eV
energy mae: 2.4896 eV
energy mae: 2.7207 eV
energy mae: 3.9601 eV
energy mae: 2.5159 eV
energy mae: 4.3503 eV
energy mae: 5.5353 eV
energy mae: 6.5213 eV
energy mae: 5.3084 eV
energy mae: 5.1909 eV
energy mae: 3.0485 eV
energy mae: 3.7798 eV
energy mae: 3.8274 eV
energy mae: 3.7570 eV
energy mae: 4.8310 eV
energy mae: 5.6203 eV
energy mae: 4.2626 eV
energy mae: 4.2015 eV
energy mae: 6.2995 eV
energy mae: 6.2419 eV
energy mae: 6.7976 eV
energy mae: 5.3529 eV
energy mae: 3.8578 eV
energy mae: 4.3435 eV
energy mae: 4.6956 eV
energy mae: 5.4229 eV
energy mae: 3.8518 eV
energy mae: 2.9459 eV
energy mae: 2.5329 eV
energy mae: 1.5465 eV
energy mae: 0.9443 eV
energy mae: 1.6042 eV
energy mae: 3.2754 eV
energy mae: 5.7172 eV
energy mae: 4.0533 eV
energy mae: 4.9140 eV
energy mae: 4.0036 eV
energy mae: 3.5122 eV
energy mae: 3.6981 eV
energy mae: 4.5726 eV
energy mae: 3.4951 eV
energy mae: 3.3984 eV
energy mae: 4.4321 eV
energy mae: 3.7758 eV
energy mae: 4.2584 eV
energy mae: 3.2534 eV
energy mae: 4.8749 eV
energy mae: 6.2909 eV
energy mae: 5.3607 eV
energy mae: 6.2928 eV
energy mae: 4.8835 eV
energy mae: 6.0457 eV
energy mae: 7.3016 eV
energy mae: 7.3053 eV
energy mae: 6.2537 eV
energy mae: 5.4464 eV
energy mae: 5.0429 eV
energy mae: 5.1737 eV
energy mae: 6.7264 eV
energy mae: 6.6276 eV
energy mae: 5.6163 eV
energy mae: 5.4634 eV
energy mae: 5.1897 eV
energy mae: 5.7783 eV
energy mae: 4.5612 eV
energy mae: 4.9538 eV
energy mae: 6.1363 eV
energy mae: 6.2278 eV
energy mae: 6.1512 eV
energy mae: 5.8609 eV
energy mae: 5.2928 eV
energy mae: 6.3415 eV
energy mae: 6.1416 eV
energy mae: 6.4574 eV
energy mae: 7.6178 eV
energy mae: 6.8234 eV
energy mae: 6.8323 eV
energy mae: 8.0934 eV
energy mae: 9.2747 eV
energy mae: 8.7612 eV
energy mae: 7.0277 eV
energy mae: 6.8970 eV
energy mae: 6.3439 eV
energy mae: 7.5710 eV
energy mae: 7.7802 eV
energy mae: 6.9109 eV
energy mae: 7.6083 eV
energy mae: 6.7630 eV
energy mae: 8.0223 eV
energy mae: 8.2862 eV
energy mae: 9.3410 eV
energy mae: 9.6269 eV
energy mae: 9.9936 eV
energy mae: 10.1480 eV
energy mae: 11.1701 eV
energy mae: 9.9167 eV
energy mae: 9.1505 eV
energy mae: 8.7440 eV
energy mae: 7.5971 eV
energy mae: 6.7653 eV
energy mae: 8.0176 eV
energy mae: 6.6477 eV
energy mae: 7.0426 eV
energy mae: 7.9049 eV
energy mae: 8.9931 eV
energy mae: 9.3207 eV
energy mae: 10.6333 eV
energy mae: 11.3318 eV
energy mae: 12.4522 eV
energy mae: 9.8949 eV
energy mae: 10.1299 eV
energy mae: 8.7420 eV
energy mae: 9.3951 eV
energy mae: 11.3263 eV
energy mae: 11.6740 eV
energy mae: 12.3935 eV
energy mae: 11.5213 eV
energy mae: 10.1154 eV
energy mae: 11.3596 eV
energy mae: 9.9120 eV
energy mae: 10.8610 eV
energy mae: 11.9575 eV
energy mae: 12.6236 eV
energy mae: 13.7832 eV
energy mae: 13.0603 eV
energy mae: 12.7233 eV
energy mae: 13.7000 eV
energy mae: 12.9169 eV
energy mae: 12.8120 eV
energy mae: 12.5336 eV
energy mae: 12.9465 eV
energy mae: 11.6466 eV
energy mae: 11.0176 eV
energy mae: 11.2135 eV
energy mae: 8.6667 eV
energy mae: 7.2858 eV
energy mae: 8.9789 eV
energy mae: 7.8325 eV
energy mae: 6.7679 eV
energy mae: 7.1905 eV
energy mae: 7.0116 eV
energy mae: 8.5520 eV
energy mae: 9.3439 eV
energy mae: 7.8790 eV
energy mae: 8.7422 eV
energy mae: 9.8198 eV
energy mae: 10.1962 eV
energy mae: 9.7840 eV
energy mae: 8.6575 eV
energy mae: 7.5451 eV
energy mae: 6.4088 eV
energy mae: 9.5627 eV
energy mae: 9.6822 eV
energy mae: 8.9451 eV
energy mae: 10.3366 eV
energy mae: 9.8194 eV
energy mae: 8.9939 eV
energy mae: 8.9014 eV
energy mae: 8.5694 eV
energy mae: 8.7023 eV
energy mae: 8.3402 eV
energy mae: 9.5136 eV
energy mae: 8.6086 eV
energy mae: 8.4197 eV
energy mae: 5.7802 eV
energy mae: 5.2688 eV
energy mae: 4.7184 eV
energy mae: 6.4373 eV
energy mae: 6.0839 eV
energy mae: 7.7399 eV
energy mae: 7.7399 eV
energy mae: 7.7532 eV
energy mae: 7.2610 eV
energy mae: 8.3390 eV
energy mae: 9.5546 eV
energy mae: 11.2016 eV
energy mae: 11.3947 eV
energy mae: 12.4230 eV
energy mae: 12.7769 eV
energy mae: 13.0084 eV
energy mae: 12.8821 eV
energy mae: 14.3463 eV
energy mae: 14.4470 eV
energy mae: 15.7711 eV
energy mae: 13.4351 eV
energy mae: 12.8603 eV
energy mae: 12.4138 eV
energy mae: 10.9812 eV
energy mae: 9.8611 eV
energy mae: 10.9384 eV
energy mae: 9.8697 eV
energy mae: 9.8535 eV
energy mae: 8.5961 eV
energy mae: 7.8845 eV
energy mae: 7.5840 eV
energy mae: 6.3671 eV
energy mae: 2.9021 eV
energy mae: 4.9227 eV
energy mae: 4.0532 eV
energy mae: 4.4025 eV
energy mae: 4.6155 eV
energy mae: 7.8598 eV
energy mae: 6.9116 eV
energy mae: 6.0066 eV
energy mae: 5.3031 eV
energy mae: 6.4731 eV
energy mae: 6.1282 eV
energy mae: 6.5038 eV
energy mae: 4.6521 eV
energy mae: 4.5353 eV
energy mae: 6.0242 eV
energy mae: 5.0789 eV
energy mae: 5.7213 eV
energy mae: 6.1787 eV
energy mae: 6.4513 eV
energy mae: 7.3848 eV
energy mae: 7.2421 eV
energy mae: 6.6136 eV
energy mae: 5.8530 eV
energy mae: 6.4601 eV
energy mae: 5.7140 eV
energy mae: 5.0170 eV
energy mae: 5.8671 eV
energy mae: 8.7236 eV
energy mae: 8.0099 eV
energy mae: 6.5371 eV
energy mae: 5.6670 eV
energy mae: 6.2031 eV
energy mae: 7.5703 eV
energy mae: 7.2463 eV
energy mae: 6.4997 eV
energy mae: 6.3310 eV
energy mae: 9.9260 eV
energy mae: 9.0904 eV
energy mae: 9.1911 eV
energy mae: 9.3341 eV
energy mae: 8.5923 eV
energy mae: 10.2608 eV
energy mae: 10.2946 eV
energy mae: 11.9593 eV
energy mae: 10.7358 eV
energy mae: 11.1939 eV
energy mae: 10.5738 eV
energy mae: 11.0525 eV
energy mae: 11.2045 eV
energy mae: 12.8875 eV
energy mae: 14.3059 eV
energy mae: 13.6059 eV
energy mae: 14.0490 eV
energy mae: 14.2683 eV
energy mae: 12.5639 eV
energy mae: 13.4562 eV
energy mae: 13.6611 eV
energy mae: 12.6352 eV
energy mae: 12.2801 eV
energy mae: 13.0000 eV
energy mae: 10.5159 eV
energy mae: 8.4466 eV
energy mae: 9.9655 eV
energy mae: 8.6534 eV
energy mae: 9.2614 eV
energy mae: 7.0083 eV
energy mae: 7.8094 eV
energy mae: 4.3186 eV
energy mae: 5.1512 eV
energy mae: 6.4425 eV
energy mae: 5.0850 eV
energy mae: 6.0888 eV
energy mae: 8.9941 eV
energy mae: 8.5203 eV
energy mae: 8.7701 eV
energy mae: 8.7701 eV
energy mae: 9.0056 eV
energy mae: 7.9156 eV
energy mae: 7.7265 eV
energy mae: 7.1488 eV
energy mae: 7.1657 eV
energy mae: 7.2221 eV
energy mae: 5.9179 eV
energy mae: 5.6422 eV
energy mae: 5.9878 eV
energy mae: 5.8336 eV
energy mae: 4.9300 eV
energy mae: 4.1940 eV
energy mae: 4.3445 eV
energy mae: 2.4763 eV
energy mae: 2.6949 eV
energy mae: 4.4704 eV
energy mae: 5.7731 eV
energy mae: 5.8620 eV
energy mae: 7.0675 eV
energy mae: 7.8671 eV
energy mae: 7.5589 eV
energy mae: 5.8291 eV
energy mae: 6.1577 eV
energy mae: 8.9022 eV
energy mae: 11.2028 eV
energy mae: 11.9735 eV
energy mae: 14.2759 eV
energy mae: 14.7392 eV
energy mae: 12.2281 eV
energy mae: 12.1335 eV
energy mae: 11.9616 eV
energy mae: 9.8048 eV
energy mae: 11.1278 eV
energy mae: 9.3350 eV
energy mae: 7.4344 eV
energy mae: 5.7895 eV
energy mae: 6.1796 eV
energy mae: 7.1456 eV
energy mae: 5.6937 eV
energy mae: 4.4446 eV
energy mae: 3.5896 eV
energy mae: 5.3565 eV
energy mae: 5.9765 eV
energy mae: 5.8848 eV
energy mae: 4.9929 eV
energy mae: 3.3400 eV
energy mae: 1.9077 eV
energy mae: 2.1140 eV
energy mae: 1.2519 eV
energy mae: 3.7352 eV
energy mae: 4.4694 eV
energy mae: 4.6825 eV
energy mae: 3.4600 eV
energy mae: 2.5086 eV
energy mae: 3.7450 eV
energy mae: 3.8189 eV
energy mae: 5.6064 eV
energy mae: 4.9411 eV
energy mae: 3.6392 eV
energy mae: 3.2537 eV
energy mae: 3.7463 eV
energy mae: 3.2651 eV
energy mae: 4.4460 eV
energy mae: 5.1457 eV
energy mae: 7.0533 eV
energy mae: 6.4071 eV
energy mae: 5.7949 eV
energy mae: 4.8035 eV
energy mae: 4.6217 eV
energy mae: 4.5854 eV
energy mae: 5.6789 eV
energy mae: 5.4989 eV
energy mae: 5.2477 eV
energy mae: 5.3539 eV
energy mae: 5.3579 eV
energy mae: 5.5029 eV
energy mae: 8.4894 eV
energy mae: 7.6532 eV

In the past, when I loaded all structures, after each epoch, the MAE would lower, and would be much lower than the metrics here. Is it that my model configuration is improper? Am I not loading enough structures or sites? Is there something else I'm doing wrong?

YuuuXie commented 1 year ago

Hi @mhsiron , sorry for the late reply. Your yaml file format seems to be a bit messed up, I wonder what kind of cutoff did you try to impose?

flare_calc:
  cutoff: 4
  descriptors:
  - cutoff_function: quadratic
    cutoff_matrix:
    - - 5.0

I don't quite understand from here why there is cutoff: 4, and then there is cutoff_matrix: - - 5. It looks very confusing

mhsiron commented 1 year ago

Hi @YuuuXie, no worries!

I changed the cutoff_matrix to be the same as the cutoff (matrix: [[5.0]] and cutoff 5.0). I still get large fluctuations for the energy mae's that do not appear to converge.

I am using PyYAML to write my configuration file, - - 5.0 is similar to [[5.0]]. Just to double check I have also tried it with the brackets, and I am receiving the same fluctuations.

YuuuXie commented 1 year ago

Hi @mhsiron , I'm not sure what your system is, but previously you were running on this two-species system,

    species:
        - 22
        - 8

and now you are just running on carbon?

  species:
  - 6

Could you give me a bit more information on your system and dataset?

I would suggest you to check a few different cutoffs, and also check the energy noise (Hyp1) in the flare output file, and see if the energy noise is very large.

mhsiron commented 1 year ago

Hi @YuuuXie, that's correct. I changed/simplify the system just to play around with the Flare parameters. The previous system was 3 different phases of TiO2. This current system is the trajectory of an amorphous carbon MD simulation at ~1000K using VASP.

mhsiron commented 1 year ago

Both the force, stress MAEs were also fluctuating. Right now I am trying a force-only training. I still do not get convergence and the parameters are: [3.21365770e+01 1.00000000e-03 5.62081974e-01 1.00000000e-03]

mhsiron commented 1 year ago

In the original code (top of the thread), I would be able to get MAEs that converged at each iteration with force MAEs < 150meV. However the original code required too much memory for my system, and so I switched to the FakeDFT method.

mhsiron commented 1 year ago

Here is the dataset I am using: processed_dataset.zip

YuuuXie commented 1 year ago

@mhsiron Do you mean that you are using exactly the same datasets, parameters and settings in the python script as the yaml file, but you get very different error ranges in them?

mhsiron commented 1 year ago

Unsure! Because in the python script I cannot load all of the dataset, I can only load maybe ~50 structures before running out of memory. But on those structures I can get a low MAE that I cannot reproduce with the same dataset, same parameters as with the yaml file. However, I believe the FakeDFT does not load every site nor every structure?

YuuuXie commented 1 year ago

You could probably try using the same subset of structures and see if you can reproduce the result. Also neither in FakeDFT not in the python script need you to load the whole dataset (by "load" I mean transferring them into a flare Structure object in which all the descriptors are computed, but all the x,y,z coordinates of the frames are indeed loaded from the file)

mhsiron commented 1 year ago

I see. But there is nothing out of the ordinary in the config file that would lead to large energy/force/stress fluctuations that appear to not be converging?

YuuuXie commented 1 year ago

If everything in the python and yaml is exactly the same, I wouldn't expect any difference in the outcome

mhsiron commented 1 year ago

I will remove structures in the dataset to see if there is a difference between the two. What could cause non-convergence in Flare in general?