About cross validation procedure in SISSO

jungsdao commented 3 years ago

Hello, thank you for developing fascinating tools for descriptor identification. I have some questions when performing cross validation for regression.

In the paper, J. Phys.: Mater. 2, 024002 (2019), standardization of the input features has been mentioned. Also to avoid contamination of train/test dataset, it is stated that standardization within only training set is recommended. Question is, do I need to standardize the input features of materials when writing the "train.dat" file? Or does SISSO fortran code includes standardization process during regression?

If the standardization should be performed outside of current SISSO code, then should I standardize training data and test data separately to avoid above mentioned contamination of dataset? I have seen several posts relating this scaling issue, but still confused.. #3, #10

Using ST-SISSO, I tried Leave 10% out cross validation on my dataset by randomly selecting prediction set 30 times as suggested in the paper. This results in 30 models with some of descriptors having very high frequency of occurrence. For instance, 2D descriptor was shown to give small RMSD error and Among 30 models, how can I decide optimum descriptors and corresponding coefficients and intercept? Following is an example of L10%out CV with 30 times result.

CV RMSE MaxAE_T MaxAE_P interc coeffs. Descriptor

CV1 14.07 58.67 CV2 13.04 39.09 CV3 12.28 41.09 CV4 13.38 39.27 CV5 12.31 38.70 CV6 13.07 43.51 CV7 12.47 41.58 CV8 13.02 58.61 CV9 13.20 40.14 CV10 13.07 42.08 CV11 13.97 59.30 CV12 14.11 57.08 CV13 10.81 36.53 CV14 14.01 58.78 CV15 13.05 48.58 CV16 13.28 58.42 CV17 13.93 56.94 CV18 11.93 38.80 CV19 13.71 41.30 CV20 13.29 57.89 CV21 12.21 41.33 CV22 13.87 58.06 CV23 12.30 44.19 CV24 13.54 50.41 CV25 12.16 37.31 CV26 13.92 42.68 CV27 13.59 41.22 CV28 14.33 57.50 CV29 13.59 40.87 CV30 12.57 49.80 108.80 81.19 -0.83, 34.61 [((Eb_Ni/LUMO)/cos(E_pro))], [((Eb_HF-Eb_Li)abs(Eb_Li-Eb_PF5))] 60.40 81.67 -0.89, 1.19 [((Eb_Ni/LUMO)/cos(E_pro))], [((Dipole/Type)/log(LUMO))] 70.24 76.58 0.86, 7.56 [((Dipole/LUMO)/cos(E_pro))], [((Eb_HF-Eb_Ni)+(TypeEb_PF5))] 58.53 81.78 -0.89, 1.19 [((Eb_Ni/LUMO)/cos(E_pro))], [((Dipole/Type)/log(LUMO))] 79.80 79.02 -0.98, -10.34 [((Eb_Ni/LUMO)/cos(E_pro))], [((Eb_Ni-Eb_PF5)abs(Eb_Li-Eb_PF5))] 51.02 83.42 4.74, 0.71 [(cos(LUMO)/cos(E_pro))], [((Dipole/LUMO)/log(LUMO))] 68.29 76.50 0.85, 8.00 [((Dipole/LUMO)/cos(E_pro))], [((Eb_HF-Eb_Ni)+(TypeEb_PF5))] 43.21 76.00 -0.87, -9.38 [((Eb_Ni/LUMO)/cos(E_pro))], [((Eb_Ni-Eb_PF5)-(TypeEb_HF))] 57.41 82.90 -0.90, -0.28 [((Eb_Ni/LUMO)/cos(E_pro))], [((Eb_NiDipole)/log(LUMO))] 42.21 83.64 0.80, 1.13 [((Dipole/LUMO)/cos(E_pro))], [((Dipole/Type)/log(LUMO))] 31.19 77.68 -0.82, 6.74 [((Eb_Ni/LUMO)/cos(E_pro))], [((Eb_Ni/Eb_Li)abs(Eb_Li-Eb_PF5))] 297.52 93.33 -0.82, -2.80 [((Eb_Ni/LUMO)/cos(E_pro))], [(log(Type)/abs(Eb_Li-Eb_PF5))] 104.99 85.06 -7.16, -0.31 [cos((HOMO/Eb_Li))], [((Eb_PF5/Eb_HF)/cos(LUMO))] 175.01 80.32 -0.81, 36.92 [((Eb_Ni/LUMO)/cos(E_pro))], [((Eb_HF-Eb_Li)abs(Eb_Li-Eb_PF5))] 98.92 85.57 1.18, -2.26 [((Type/LUMO)/cos(E_pro))], [(sin(HOMO)/cos(E_pro))] 107.78 123.92 -0.83, -50.59 [((Eb_Ni/LUMO)/cos(E_pro))], [exp(-abs(Eb_Li-Eb_PF5))] 20.18 81.26 -0.79, 36.41 [((Eb_Ni/LUMO)/cos(E_pro))], [((Eb_HF-Eb_Li)abs(Eb_Li-Eb_PF5))] 105.97 80.57 -2.46, 34.30 [((LUMOEb_HF)/cos(E_pro))], [(abs(Eb_HF-Eb_Li)abs(Eb_Li-Eb_PF5))] 17.27 82.92 0.78, -0.26 [((Dipole/LUMO)/cos(E_pro))], [((Eb_NiDipole)/log(LUMO))] 44.36 82.85 0.72, -20.56 [((Dipole/LUMO)/cos(E_pro))], [(abs(Eb_HF-Eb_PF5)-abs(Eb_Li-Eb_PF5))] 213.96 77.13 0.87, 7.37 [((Dipole/LUMO)/cos(E_pro))], [((Eb_HF-Eb_Ni)+(TypeEb_PF5))] 113.07 74.70 -0.84, -9.94 [((Eb_Ni/LUMO)/cos(E_pro))], [((Eb_Ni-Eb_PF5)-(TypeEb_HF))] 337.82 80.24 -11.52, 0.45 [((Eb_Li/LUMO))^6], [(exp(-Eb_Ni)/sin(E_pro))] 106.22 82.99 1.37, -5.64 [((volume/LUMO)/cos(E_pro))], [((Eb_Li-Eb_PF5)/(Eb_HF/Eb_Li))] 69.68 80.06 -0.98, 40.87 [((Eb_Ni/LUMO)/cos(E_pro))], [((Eb_HF-Eb_Li)abs(Eb_Li-Eb_PF5))] 128.83 87.57 -0.13, 1.48 [exp((Dipole/LUMO))], [(log(Type)/cos(E_pro))] 291.34 82.68 0.76, -0.27 [((Dipole/LUMO)/cos(E_pro))], [((Eb_NiDipole)/log(LUMO))] 24.78 81.69 -0.81, 35.23 [((Eb_Ni/LUMO)/cos(E_pro))], [((Eb_HF-Eb_Li)*abs(Eb_Li-Eb_PF5))] 51.63 82.26 0.75, 1.08 [((Dipole/LUMO)/cos(E_pro))], [((Dipole/Type)/log(LUMO))] 109.57 86.00 1.96, -2.39 [((volume/LUMO)/cos(E_pro))], [(sin(HOMO)/cos(E_pro))]

I plotted boxplot as presented in the paper to my research project. Unlikely to what was given in the paper, where overall error decrease with the increase dimension and rung (increasing complexity), My result indicates the opposite. Overall error is increased with the dimension and rung. How should I interpret the result? Does this mean that my selected features are inappropriate to describe the target property? (I'm trying to predict experimental values using DFT calculated features which is actually pretty demanding task)
How should I compose "train.dat" file for MT_SISSO in case I have some missing features? I think I'm having trouble with the attached "train.txt" file. For some material, several features are not available and cannot be given into the input file. So I left it as blank... but is it correct way of making input file? If not, how can I make it for MT-SISSO?

My questions are quite naive and dummy, but would be appreciated for your help! train.txt

rouyang2017 commented 3 years ago

Hi,

I suggest not to standardize the input primary features in the train.dat file, otherwise the physical meaning of mathematical combination between the primary features may be lost. The SISSO code will do standardization ONLY inside the SIS routine for evaluation of feature importance, but never do it for feature construction and the final model.
Usually we use the whole-data descriptor (and corresponding coeff.), instead of the ones from CV. We do CV to check the stability (sensitivity of the descriptor form on samples) of the whole-data descriptor, e.g. how many times the whole-data descriptor is identified in the CV. Of course you can still report the average CV error regardless of the different form of the CV descriptors.
Generally, increasing of the prediction error is signaling overfitting, but understanding why is that require closer look at your primary features and your data.
"Blank" is not allowed in the code. You could simply remove those features or those materials of missing data, or provide estimated (e.g. interpolation ? ) values to those missing data.

jungsdao commented 3 years ago

Thank you Dr. Ouyang for your reply.

You answer clarifies me a lot. I have one more question about multi task learning. Can I perform cross validation for MT-SISSO? Since the number of samples included in the each MT-SISSO differs, it's hard to apply the same cross validation scheme in this case. It seems provided utilities support only cross validation for ST-SISSO. If there's certain protocol to try for cross validation (e.g. Leave 10% out ) for MT-SISSO, it would be appreciated.

rouyang2017 commented 3 years ago

Yes, CV for the MT-SISSO is more complicated than ST-SISSO as in the MT case the samples and coefficients for different tasks could be different. Users can design CV schemes (e.g. the J. Phys.: Mater. 2, 024002 (2019) describes two solutions) suitable for their specific applications and purposes. It is also possible to test on unseen data (even on new tasks beyond training data).

rouyang2017 / SISSO

About cross validation procedure in SISSO #35

CV RMSE MaxAE_T MaxAE_P interc coeffs. Descriptor