rouyang2017 / SISSO

A data-driven method combining symbolic regression and compressed sensing for accurate & interpretable models.
Apache License 2.0
251 stars 84 forks source link

On the construction of high-dimensional feature space #67

Open souno1218 opened 3 months ago

souno1218 commented 3 months ago

When constructing a high-dimensional feature space from primary features, the SISSO Fortran code appears to operate in a recursive manner, applying operators set to the arbitrary elements in the feature space and expanding it.

We believe that the code may potentially omit a certain form of the feature. If that is the case, we would be grateful for any feedback you could provide regarding the omitted features. We hope you will forgive us if the above question stems from our careless mistakes.

We would like to respectfully bring the following concerns to your attention: Despite our best efforts, we have been unable to find the features, such as A/(B+(C+D)), A*(B+(B+C)) and so on, which we believe should be included in the feature space when expanding it using the provided data set (numerical values have been generated randomly). Furthermore, we would be grateful for any insight you could provide regarding the features that may have been omitted during the expansion of the feature space.

train.dat \======================================== name, A, B, C, D, E,
data_0 0.3565 0.5772 0.2283 0.9890 0.1637 data_1 0.5920 0.8401 0.4583 0.0922 0.8506 data_2 …. \========================================

SISSO.in \======================================== nsf= 5
ops='(+)(-)(*)(/)'
fcomplexity=3
funit=(1:5)
nf_sis=200000 \========================================

rouyang2017 commented 3 months ago

Hi, this is a detail that we did not note in the paper PRM 2, 083802 (2019).

To see the expressions A/(B+(C+D)) and A*(B+(B+C), please increase the fcomplexity to 4 (fcomplexity=4). 1723001037878 1723001037895

Following the feature generation scheme in PRM 2, 083802 (2019), we have the feature space Phi of different rung. Phi_0 contains all the primary features (fcomplexity=0); Phi_1 contains all the features with fcomplexity<=1; Phi_2 contains all the features with fcomplexity<=2 and MOST features with fcomplexity =3; Phi_3 contains features with higher fcomplexity ...

Unfortunately, the expression A/(B+(C+D)) appear in Phi_3, but not in Phi_2, though its fcomplexity is 3 (3 operators). The reason is that it involve 3 recursive calls (Phi_3), i.e.: 1) C+D 2) B+(C+D) 3) A/(B+(C+D))

These should answer your question.

rouyang2017 commented 3 months ago

Here I used just 4 samples and a small nf_sis, and so I do not see the A*(B+(B+C)). It can be found by increasing the nf_sis.

souno1218 commented 3 months ago

I'm very grateful for your quick reply. I think I may now have a better grasp on what Phi_N signifies in the output file. If I've understood correctly, Phi_3 represents the upper limit, and even with fcomplexity = 7, Phi_4 was not calculated. Could I just check whether I've understood correctly that in this case, a calculation like A+(B/(C+(D+E))) is difficult in principle? I'm not proficient in English, so I'm using DeepL. I apologize if I've been impolite or if I've misunderstood.

rouyang2017 commented 3 months ago

Yes. Phi_4 and higher rung require too much memory to be doable in the current code. Future versions will make Phi_4 possible.

That's a good question. A+(B/(C+(D+E))) seems to be in Phi_4, which necessitates the calculation of high-rung feature space. We are planning to work on this in near future.

souno1218 commented 3 months ago

I'm grateful for your help in clarifying this matter. I'm pleased to see the ongoing progress !