rouyang2017 / SISSO

A data-driven method combining symbolic regression and compressed sensing for accurate & interpretable models.
Apache License 2.0
251 stars 84 forks source link

Selection fo SIS subspace for respective dimensions #42

Closed jungsdao closed 2 years ago

jungsdao commented 3 years ago

Hello, Dr. Ouyang

I'd like to ask about setting separate SIS subspace values for each dimensions. As far as I know, although SIS value itself is not a hyperparameter, it influences the probability to found better descriptor. In previous several papers, they set different values for each dimension. I'd like to also set different SIS value for each dimension in a decreasing number according to dimension. But it seems it's not working as I expected when I run the code.

I set SIS value for each dimension as subs_sis=6105, 2000, 1000, 100, 50 in SISSO.in file, but in SISSO.out, this is not correctly applied and making overall calculation unreasonably long. Please also find attached input and output files.

One more question is, about the baseline. In your previous paper (Journal of Physics: Materials 2.2 (2019): 024002.), you argued that the RMSD of SISSO model should be smaller than the standard deviation of reference (target) values (aka baseline). If predicted RMSD from SISSO is not lower than the baseline value, does it mean that the model is overfitting or not properly learning from data?

Any comment would be appreciated and if you need further information, please tell me. Many thanks in advance.

Best regards, Hyunwook SISSO_SIS.zip

rouyang2017 commented 3 years ago

Hi Hyunwook, From the SISSO.out I see the information: Total number of features in the space phi00: 12 Total number of features in the space phi01: 102 Total number of features in the space phi02: 6105 Size of the SIS-selected subspace from phi02: 3305 which means that your total feature space has only 3305 features!!! Thus, when you set subs_sis=6105, 2000, 1000, 100, 50, it means the whole feature space had been selected at the 1st dimension. There is no more features for higher dimensions (note that the total feature space is the same for all dimensions).

Since training RMSE (root mean squared error) should be well lower than SD (population standard deviation) if the fitting is good, I believe the prediction RMSE that is even greater than SD must indicate overfitting.

Best Regards