Closed jungsdao closed 2 years ago
Hi,
I suggest not to standardize the input primary features in the train.dat file, otherwise the physical meaning of mathematical combination between the primary features may be lost. The SISSO code will do standardization ONLY inside the SIS routine for evaluation of feature importance, but never do it for feature construction and the final model.
Usually we use the whole-data descriptor (and corresponding coeff.), instead of the ones from CV. We do CV to check the stability (sensitivity of the descriptor form on samples) of the whole-data descriptor, e.g. how many times the whole-data descriptor is identified in the CV. Of course you can still report the average CV error regardless of the different form of the CV descriptors.
Generally, increasing of the prediction error is signaling overfitting, but understanding why is that require closer look at your primary features and your data.
"Blank" is not allowed in the code. You could simply remove those features or those materials of missing data, or provide estimated (e.g. interpolation ? ) values to those missing data.
Thank you Dr. Ouyang for your reply.
You answer clarifies me a lot. I have one more question about multi task learning. Can I perform cross validation for MT-SISSO? Since the number of samples included in the each MT-SISSO differs, it's hard to apply the same cross validation scheme in this case. It seems provided utilities support only cross validation for ST-SISSO. If there's certain protocol to try for cross validation (e.g. Leave 10% out ) for MT-SISSO, it would be appreciated.
Yes, CV for the MT-SISSO is more complicated than ST-SISSO as in the MT case the samples and coefficients for different tasks could be different. Users can design CV schemes (e.g. the J. Phys.: Mater. 2, 024002 (2019) describes two solutions) suitable for their specific applications and purposes. It is also possible to test on unseen data (even on new tasks beyond training data).
Hello, thank you for developing fascinating tools for descriptor identification. I have some questions when performing cross validation for regression.
If the standardization should be performed outside of current SISSO code, then should I standardize training data and test data separately to avoid above mentioned contamination of dataset? I have seen several posts relating this scaling issue, but still confused.. #3, #10
CV RMSE MaxAE_T MaxAE_P interc coeffs. Descriptor
CV1 14.07 58.67 108.80 81.19 -0.83, 34.61 [((Eb_Ni/LUMO)/cos(E_pro))], [((Eb_HF-Eb_Li)abs(Eb_Li-Eb_PF5))] CV2 13.04 39.09 60.40 81.67 -0.89, 1.19 [((Eb_Ni/LUMO)/cos(E_pro))], [((Dipole/Type)/log(LUMO))] CV3 12.28 41.09 70.24 76.58 0.86, 7.56 [((Dipole/LUMO)/cos(E_pro))], [((Eb_HF-Eb_Ni)+(TypeEb_PF5))] CV4 13.38 39.27 58.53 81.78 -0.89, 1.19 [((Eb_Ni/LUMO)/cos(E_pro))], [((Dipole/Type)/log(LUMO))] CV5 12.31 38.70 79.80 79.02 -0.98, -10.34 [((Eb_Ni/LUMO)/cos(E_pro))], [((Eb_Ni-Eb_PF5)abs(Eb_Li-Eb_PF5))] CV6 13.07 43.51 51.02 83.42 4.74, 0.71 [(cos(LUMO)/cos(E_pro))], [((Dipole/LUMO)/log(LUMO))] CV7 12.47 41.58 68.29 76.50 0.85, 8.00 [((Dipole/LUMO)/cos(E_pro))], [((Eb_HF-Eb_Ni)+(TypeEb_PF5))] CV8 13.02 58.61 43.21 76.00 -0.87, -9.38 [((Eb_Ni/LUMO)/cos(E_pro))], [((Eb_Ni-Eb_PF5)-(TypeEb_HF))] CV9 13.20 40.14 57.41 82.90 -0.90, -0.28 [((Eb_Ni/LUMO)/cos(E_pro))], [((Eb_NiDipole)/log(LUMO))] CV10 13.07 42.08 42.21 83.64 0.80, 1.13 [((Dipole/LUMO)/cos(E_pro))], [((Dipole/Type)/log(LUMO))] CV11 13.97 59.30 31.19 77.68 -0.82, 6.74 [((Eb_Ni/LUMO)/cos(E_pro))], [((Eb_Ni/Eb_Li)abs(Eb_Li-Eb_PF5))] CV12 14.11 57.08 297.52 93.33 -0.82, -2.80 [((Eb_Ni/LUMO)/cos(E_pro))], [(log(Type)/abs(Eb_Li-Eb_PF5))] CV13 10.81 36.53 104.99 85.06 -7.16, -0.31 [cos((HOMO/Eb_Li))], [((Eb_PF5/Eb_HF)/cos(LUMO))] CV14 14.01 58.78 175.01 80.32 -0.81, 36.92 [((Eb_Ni/LUMO)/cos(E_pro))], [((Eb_HF-Eb_Li)abs(Eb_Li-Eb_PF5))] CV15 13.05 48.58 98.92 85.57 1.18, -2.26 [((Type/LUMO)/cos(E_pro))], [(sin(HOMO)/cos(E_pro))] CV16 13.28 58.42 107.78 123.92 -0.83, -50.59 [((Eb_Ni/LUMO)/cos(E_pro))], [exp(-abs(Eb_Li-Eb_PF5))] CV17 13.93 56.94 20.18 81.26 -0.79, 36.41 [((Eb_Ni/LUMO)/cos(E_pro))], [((Eb_HF-Eb_Li)abs(Eb_Li-Eb_PF5))] CV18 11.93 38.80 105.97 80.57 -2.46, 34.30 [((LUMOEb_HF)/cos(E_pro))], [(abs(Eb_HF-Eb_Li)abs(Eb_Li-Eb_PF5))] CV19 13.71 41.30 17.27 82.92 0.78, -0.26 [((Dipole/LUMO)/cos(E_pro))], [((Eb_NiDipole)/log(LUMO))] CV20 13.29 57.89 44.36 82.85 0.72, -20.56 [((Dipole/LUMO)/cos(E_pro))], [(abs(Eb_HF-Eb_PF5)-abs(Eb_Li-Eb_PF5))] CV21 12.21 41.33 213.96 77.13 0.87, 7.37 [((Dipole/LUMO)/cos(E_pro))], [((Eb_HF-Eb_Ni)+(TypeEb_PF5))] CV22 13.87 58.06 113.07 74.70 -0.84, -9.94 [((Eb_Ni/LUMO)/cos(E_pro))], [((Eb_Ni-Eb_PF5)-(TypeEb_HF))] CV23 12.30 44.19 337.82 80.24 -11.52, 0.45 [((Eb_Li/LUMO))^6], [(exp(-Eb_Ni)/sin(E_pro))] CV24 13.54 50.41 106.22 82.99 1.37, -5.64 [((volume/LUMO)/cos(E_pro))], [((Eb_Li-Eb_PF5)/(Eb_HF/Eb_Li))] CV25 12.16 37.31 69.68 80.06 -0.98, 40.87 [((Eb_Ni/LUMO)/cos(E_pro))], [((Eb_HF-Eb_Li)abs(Eb_Li-Eb_PF5))] CV26 13.92 42.68 128.83 87.57 -0.13, 1.48 [exp((Dipole/LUMO))], [(log(Type)/cos(E_pro))] CV27 13.59 41.22 291.34 82.68 0.76, -0.27 [((Dipole/LUMO)/cos(E_pro))], [((Eb_NiDipole)/log(LUMO))] CV28 14.33 57.50 24.78 81.69 -0.81, 35.23 [((Eb_Ni/LUMO)/cos(E_pro))], [((Eb_HF-Eb_Li)*abs(Eb_Li-Eb_PF5))] CV29 13.59 40.87 51.63 82.26 0.75, 1.08 [((Dipole/LUMO)/cos(E_pro))], [((Dipole/Type)/log(LUMO))] CV30 12.57 49.80 109.57 86.00 1.96, -2.39 [((volume/LUMO)/cos(E_pro))], [(sin(HOMO)/cos(E_pro))]
I plotted boxplot as presented in the paper to my research project. Unlikely to what was given in the paper, where overall error decrease with the increase dimension and rung (increasing complexity), My result indicates the opposite. Overall error is increased with the dimension and rung. How should I interpret the result? Does this mean that my selected features are inappropriate to describe the target property? (I'm trying to predict experimental values using DFT calculated features which is actually pretty demanding task)
How should I compose "train.dat" file for MT_SISSO in case I have some missing features? I think I'm having trouble with the attached "train.txt" file. For some material, several features are not available and cannot be given into the input file. So I left it as blank... but is it correct way of making input file? If not, how can I make it for MT-SISSO?
My questions are quite naive and dummy, but would be appreciated for your help! train.txt