rouyang2017 / SISSO

A data-driven method combining symbolic regression and compressed sensing for accurate & interpretable models.
Apache License 2.0
251 stars 84 forks source link

Feature dimensions #41

Closed hgandhi2411 closed 2 years ago

hgandhi2411 commented 3 years ago

Dr. Ouyang,

Can you please explain how the feature dimensions work? In the input script, suppose I have dimclass=(1:2)(3:4)(5:5), does this assign dimensions as (mass, length, time) or does this just group features an assumes they have same dimensions?

My data looks as follows and I have also added their units here. How would I make sure that equations given by SISSO are dimensionally consistent? I'm currently using dimclass=(1:2)(3:3) and the equations generated don't seem right.

Materials Del_P Pipe_D (m) Inlet_V (m/s) angle (deg)
sample1 24.937540 0.005 0.020 1.0
sample2 23.688087 0.005 0.019 2.0
sample3 22.438908 0.015 0.018 1.0
sample4 21.190007 0.025 0.017 4.0
sample5 19.941388 0.007 0.016 5.0
... ... ... ... ...
rouyang2017 commented 3 years ago

If you set dimclass=(1:2)(3:4)(5:5), then it means in your train.dat file the features from 1st to 2nd are of the same unit and the two can be linearly combined. Likewise, (3:4) means the 3rd and 4th features have the same unit, and (5:5) means the 5th feature have other different unit. This grouping of features is to exclude unreasonable linear combinations, such as mass + length, and analysis will be applied throughout the construction of feature space. Note that in the final output model y=sum(cx), we assume the coefficients c carry units so that all terms cx have the same one unit with the target y, and thus features x with any units can appear in the linear model.

In your example, if you set dimclass=(1:2)(3:3), it means Pipe_D and Inlet_V have the same unit (which is not correct), and angle has another unit. Thus, you should set dimclass=(1:1)(2:2)(3:3). If you want Inlet_V to be dimensionless, then please do it dimclass=(1:1)(3:3), just exclude that in any round bracket.

hgandhi2411 commented 3 years ago

Dr. Ouyang, is it possible to express derived units using the dimclass variable? For example, if in my train.dat I have feature columns for mass, length and density, can density's units be expressed as mass/(length)^3? What's the best way of going about this?

rouyang2017 commented 3 years ago

That is not implemented in current code, but you can change the code. Assuming you have three features in the train.dat file, arranged as feature1(unit: mass) feature2(unit: length) feature3(unit: mass/(length)^3), then in the file SISSO.f90, inserting the following lines right after the line " call read_para_b ":

pfdim(:,1)=(/1.0, 0.0/) # unit-vector for the 1st feature (assuming it is mass) pfdim(:,2)=(/0.0, 1.0/) # unit-vector for the 2nd feature (assuming it is length) pfdim(:,3)=(/1.0, -3.0/) # unit-vector for the 3rd feature (mass/(length)^3)

Recompile the code and it should work. Please check the output in SISSO.out to confirm this.

hgandhi2411 commented 3 years ago

Prof. Ouyang, your suggestion worked well for my project. This is hard coded. So, I was wondering what would be the easiest way to make this a user input in FORTRAN, to directly take in pfdim matrix so they can group features as they wish?

rouyang2017 commented 3 years ago

Thanks. Will make this happen.

pmiam commented 1 year ago

For anyone else trying to use the feature_units file to designate the derived units of a predictor variables as described in this thread, take note that it is necessary to have at least as many opening parenthesis "(" in the funit string as you have basis units in your file.

for example

feature_units head containing 3 dimensions. one with length units, one unitless, one with density units

1 0 0 0 0 0
0 0 0 0 0 0 
0 1 -1 0 0 0

then, in SISSO.in, write funit


funit=(L)(m)(V)(E)(mol)(T)