Closed PGaneshIyer closed 4 years ago
Update: The problem is not just the shuffling of the lines (rows or samples) in train.dat, but rather the relabeling of the samples AFTER the shuffling. Certain ways of relabeling them gives different models than others. (same results in v3.0 and v3.0.2)
Hi, I have tried with your data on my computer. Both train_shuffle1.dat and train_shuffle2.dat yield exactly the same SISSO.out (only wall-clock time differ) using v3.0.2, no matter I run it in serial (1 core) or parallel (8 cores).
I uploaded the same file for the two shuffled data earlier by mistake. Please see attached zip-file witch correct files to reproduce my output. Shuffled.zip
train.dat for shuffled1 and shuffled2 are same data, but shuffled and re-labeled -- but they give different SISSO.out files.
train.dat for shuffled1 and shuffled1_norelabel are same data, but shuffled and NO re-labeling -- they now give same SISSO.out file.
I believe, re-labeling of data should not matter. Why does it seem to matter? (hope you can now reproduce my output)
Thanks.
Problem identified!
In train_shuffle1.dat all E are negative (-26.xxx), but in train_shuffle2.dat you have one E (labeled 17) being positive: 26.50094444. Thus, you have two different train.dat.
After adding a negative sign to that positive E, I got same models for them (though tiny difference of SISSO.out).
So it was an error in the dataset...that is good news, although I am at a loss how that happened, as I was using the bash 'shuf' program. Thank you for digging into this. Take care.
Hello,
I am getting different results for a regression model using the same training data (and input parameters) but with shuffled lines in the train.dat file. It seems like an unintended behavior for the code, but could also be because I made a mistake.
Attaching the two shuffled training-data (train_shuffle1,2.dat), and my SISSO.in file here. Advice would be much appreciated. Using SISSOv3.0.
Thanks. Shuffled.zip