Rana-Phy commented 2 years ago

Dear Developers,

I am trying to optimize the element profile for a multicomponent system. I am a very beginner in python doing this (manually) by python 'for loop'. I am afraid that it will take 15 years to be finished (200x200x200 number of searches). I am seeing that authors previously did it for several multicomponent systems.

Could you suggest to us some efficient and faster way to do it?

####################

loop

rcut_grid = [] for rc_1 in np.arange(4,6,0.01): for rc_2 in np.arange(4,6,0.01): for rc_3 in np.arange(4,6,0.01):

        element_profile = {'Ti': {'r': rc_1, 'w': Ti}, 'Si': {'r': rc_2 , 'w': Si}, 
               'C': {'r': rc_3, 'w': C}}
        describer = BispectrumCoefficients(rcutfac=0.5, twojmax=6, 
                               element_profile=element_profile, quadratic=False, 
                               pot_fit=True, include_stress=False, n_jobs=4)
        tsc_features = describer.transform(tsc_train_structures)
        y = tsc_df['y_orig'] / tsc_df['n']
        x = tsc_features
        simple_model = LinearRegression(n_jobs=4)
        simple_model.fit(x, y, sample_weight=weights)
        energy_indices = np.argwhere(np.array(tsc_df["dtype"]) == "energy").ravel()
        forces_indices = np.argwhere(np.array(tsc_df["dtype"]) == "force").ravel()
        simple_predict_y = simple_model.predict(x)
        original_energy = y[energy_indices]
        original_forces = y[forces_indices]
        simple_predict_energy = simple_predict_y[energy_indices]
        simple_predict_forces = simple_predict_y[forces_indices]
        e_e=mean_absolute_error(original_energy, simple_predict_energy) *10000
        e_f=mean_absolute_error(original_forces, simple_predict_forces)

        rcut_grid.append((rc_1, rc_2, rc_3, e_e, e_f))

JiQi535 commented 2 years ago

Hi Rana, I can give two pieces of advice:

Try to parallelize your grid search for the best combination of parameters. Since each combination of parameters is independent to each other, we can let them run in parallel and make selection afterwards. There are Python packages helping us to parallelize the grid search, for example, the multiprocessing package. If you can divide your search into 24 parallel processes, then the search is likely accelerated for a few times or over 10 times.

Make a reasonable size of search space for the optimal parameters. In your case, 200x200x200 number of searches seem to include too many cases which are not practical or necessary. I won't suggest an exact range for your search, but you may decide the intervals and total number of searches depends on the available resources you have.

Hi Rana, I can give two pieces of advice:

Try to parallelize your grid search for the best combination of parameters. Since each combination of parameters is independent to each other, we can let them run in parallel and make selection afterwards. There are Python packages helping us to parallelize the grid search, for example, the multiprocessing package. If you can divide your search into 24 parallel processes, then the search is likely accelerated for a few times or over 10 times.
Make a reasonable size of search space for the optimal parameters. In your case, 200x200x200 number of searches seem to include too many cases which are not practical or necessary. I won't suggest an exact range for your search, but you may decide the intervals and total number of searches depends on the available resources you have.

Rana-Phy commented 2 years ago

Thanks for your suggestion. It is fast now! Is there any technical reason behind 'divide your search into 24 parallel processes'?

JiQi535 commented 2 years ago

Thanks for your suggestion. It is fast now! Is there any technical reason behind 'divide your search into 24 parallel processes'?

Happy to know that it helps! I used "24" as an example, as there are 24 cores on each node of the computer cluster resources our group have access to. This value should be modified on different machines to achieve best efficiency.

Rana-Phy commented 2 years ago

Dear Ji Qi,

Ok, now I am seeing that multiprocessing is at least three times slower than 'n_jobs'=24 of sci-kit learn for the larger dataset. Maybe this could be because of our cluster setup or my script. I am trying to understand the maml base model classes. and my understanding could be completely wrong. So, skl_model=SKLModel(describer=describer, model=LinearRegression()) Is not the SKLModel is the model and describer is containing the hyperparameter (cut,wt,jmax) and model=LinearRegression() parameters will be learned during training/fitting? May I put element_profile in a hyperparameter optimization package like optuna, hyperopt or anything you suggest instead of doing for loop. It's not possible for some reason. I will be waiting to hear from you.

Best regards, Rana

JiQi535 commented 2 years ago

Rana-Phy

The describer here is the local environment describer of the SNAP potential, which describes the material structures in a math form. The LinearRegression is the model used in the ML training process to connect the local environment describers (input) to target properties, which are energies, forces and stresses.

For parameter tuning, I'm not aware any existing automatic algorithms for SNAP training. Please let me know if there is, which I would be interested in. In previous works from our group, we used differential evolution implemented in scipy for parameter tuning of a SNAP for Mo (http://dx.doi.org/10.1103/PhysRevMaterials.1.043603), and we also used stepwise grid search for SNAPs for alloy systems (http://dx.doi.org/10.1038/s41524-020-0339-0). Those maybe good references for you.

materialsvirtuallab / maml

element profile/hyperparameter optimization #385

loop