rouyang2017 / SISSO

A data-driven method combining symbolic regression and compressed sensing for accurate & interpretable models.
Apache License 2.0
239 stars 78 forks source link

Issues on running the code with my data #6

Closed pritwish closed 5 years ago

pritwish commented 5 years ago

Hi!

I am trying to run this code on my data, and with regard to the input file (SISSO.in) , I have a few doubts and it would be great if you can clarify it a bit.

a. nsample (does it mean the number of sample data I have, which considered primitively, is the number of rows in a generalized dataset? And if so, what do the numbers, considering them with >=2 properties (or, nprop) in the file description (mentioned as 'classification: e.g.(4,3,5),(7,9)') mean? b. nsf (is it broadly speaking, the number of features/columns I have?) c. rung (I have assumed that by rung, you mean the complexity of the feature space. How is this different from the successive parameter 'maxcomplexity'?) d. opset (doesn't putting in all the available operators seem the best option?) e. dimclass (is it necessary that all the dimensions have to be grouped linearly? for example, if I have feature 1,4 and 7 of the same dimension, how do I group them together? and do I categorize dimensionless features to be under the same dimension? f. Is there any thumb rule in the selection of the number of subspaces (subs_sis)? g. Does the input data file need to be of a specific format only? Can it have character entry in the first few columns, instead of just one? can I ignore any column? If so, how?

Thanks!

rouyang2017 commented 5 years ago

Thank you for your questions.

a. Right. nsample means the number of training samples (number of rows when nprop=1), and usually we only have one property/one map (nprop=1). In case of many properties (nprop >1), we call that multi-task learning, and a paper regarding multi-task learning with SISSO is coming soon. For nprop>1, "nsample=n1,n2,..." means you have n1 samples for property1, n2 samples for property 2, ... Correspondingly in the train.dat file, rows from 1 ( let's call the title row 0) to n1 are data for property1, and rows from n1+1 to n2 are data for property2, and so on. For categorical property (classification), each bracket indicate one classification or one map (each map can still have many groups), and nprop>1 means you are doing multi-task learning to find a common descriptor for many maps. More details can be found in our coming multi-tasking SISSO paper. Later, I will upload templates for multi-task learning (nprop>1).

b. Right. nsf is the number of primary features (columns) that you put in the train.dat file. We called that 'number of scalar feature' (a feature is one number for one sample) because we were considering 'vector features' (a feature is one vector for one sample) which is not yet implemented.

c. rung is the number of times you apply the operator set to the feature space recursively. In general, the higher the rung, the more complex of the space. For example, with rung=3( Phi3), the number of mathematical operators in each feature ranges from 0(primary feature) to 7. In Phi2, the number of operators in each feature ranges from 0 to 5. The parameter 'maxcomplexity' provides extra control of feature complexity, and it means the maximal number of operators in one feature. For example, if you have Phi3 and set maxcomplexity=4, then only those features with the number of operators <=4 will be selected from Phi3 for your model building. This is quite useful when you want to pickup the most simple model when you have a large set of highly competing candidates by reducing the maxcomplexity while within certain accuracy.

d. opset. User need to decide what operators to use.

e. In the current implementation, all the dimensions need to be grouped linearly. For example, if you have feature 1, 4 and 7 of the same dimension, then please arrange feature 1,4,7 together in the train.dat file. If the feature ID is not specified in the SISSO.in, they are dimensionless by default. For example, if have totally 10 primary features, and you set dimclass=(1:3)(6:7)(8:8) in the SISSO.in file, that means in the train.dat file the feature columns 1-3 have the same dimension, columns 4-5 are dimensionless, 6-7 of the same other dimension, and 8 is of another dimension, 9-10 are of dimensionless.

f. In general, you can set a size (subs_sis) to where some convergence is observed. In practice, we usually can't do that because of resource limitation. For example, if Phi3 with 1 billion of features are used, and you will observe a convergence at, let's say, subs_sis= 1%_of_Phi_3= 10^7. The 1% is already a huge reduction of the total space, yet the number 10^7 is still too large for most SO methods such as L0 and LASSO to manage. I would use the largest possible size to raise the success rate of identifying the best model, though we expect SISSO has high probability to obtain the best (or close to the best) model with subs_sis at at small fraction of the total space.

g. The train.dat file has to follow the format as in the template file. The fist line and first column are character entries, and all the rest (property and feature data) have to be numbers. Columns are separated by space. You can set any string for the first line and row, yet they can't be missing as the code need to read them. The code will read the columns one by one until the number of feature specified in SISSO.in is read in. For example, if nsf=10 and you have totally 20 features in the train.dat file, then the first 10 features will be read in, and the rest will be ignored. Of course the format can be changed if necessary.

Hope these answers help!

pritwish commented 5 years ago

Thanks a lot for those answers!