rouyang2017 / SISSO

A data-driven method combining symbolic regression and compressed sensing for accurate & interpretable models.
Apache License 2.0
251 stars 84 forks source link

SISSO models different for shuffled training data #31

Closed PGaneshIyer closed 4 years ago

PGaneshIyer commented 4 years ago

Hello,

I am getting different results for a regression model using the same training data (and input parameters) but with shuffled lines in the train.dat file. It seems like an unintended behavior for the code, but could also be because I made a mistake.

Attaching the two shuffled training-data (train_shuffle1,2.dat), and my SISSO.in file here. Advice would be much appreciated. Using SISSOv3.0.

Thanks. Shuffled.zip

PGaneshIyer commented 4 years ago

Update: The problem is not just the shuffling of the lines (rows or samples) in train.dat, but rather the relabeling of the samples AFTER the shuffling. Certain ways of relabeling them gives different models than others. (same results in v3.0 and v3.0.2)

rouyang2017 commented 4 years ago

Hi, I have tried with your data on my computer. Both train_shuffle1.dat and train_shuffle2.dat yield exactly the same SISSO.out (only wall-clock time differ) using v3.0.2, no matter I run it in serial (1 core) or parallel (8 cores).

image

PGaneshIyer commented 4 years ago

I uploaded the same file for the two shuffled data earlier by mistake. Please see attached zip-file witch correct files to reproduce my output. Shuffled.zip

train.dat for shuffled1 and shuffled2 are same data, but shuffled and re-labeled -- but they give different SISSO.out files.

train.dat for shuffled1 and shuffled1_norelabel are same data, but shuffled and NO re-labeling -- they now give same SISSO.out file.

I believe, re-labeling of data should not matter. Why does it seem to matter? (hope you can now reproduce my output)

Thanks.

rouyang2017 commented 4 years ago

Problem identified!

In train_shuffle1.dat all E are negative (-26.xxx), but in train_shuffle2.dat you have one E (labeled 17) being positive: 26.50094444. Thus, you have two different train.dat.

After adding a negative sign to that positive E, I got same models for them (though tiny difference of SISSO.out).

PGaneshIyer commented 4 years ago

So it was an error in the dataset...that is good news, although I am at a loss how that happened, as I was using the bash 'shuf' program. Thank you for digging into this. Take care.