scikit-learn-contrib / py-earth

A Python implementation of Jerome Friedman's Multivariate Adaptive Regression Splines
http://contrib.scikit-learn.org/py-earth/
BSD 3-Clause "New" or "Revised" License
455 stars 121 forks source link

x5 missing value None when changing input file #134

Closed florakarniav closed 7 years ago

florakarniav commented 7 years ago

Hello,

I have a very strange issue that has been bothering me all day. I am trying to train an Earth model with the values of a X.csv and Y.csv files, which I parse as follows: X = genfromtxt('X.csv', delimiter=',') into arrays. Then I run model.fit(X,y) and y_hat = model.predict(X) before plotting the predicted values vs the real ones(like in the most trivial example on README).

The problem is that after changing my input files for another use case, I see this output:

Earth Model

Basis Function Pruned Coefficient 0 Coefficient 1 Coefficient 2

(Intercept) No -2587.46 -3739.86 59.7551
x0 No 28.3284 14.8212 0.0205887
x5 Yes None None None
x5 Yes None None None
x5 Yes None None None
x5 Yes None None None
x5 Yes None None None
x0 Yes None None None
x5 Yes None None None

and instead of the real values, I see a linear plot. The code still works with my original .csv files. I run my code in the following way: 1) cython --embed -o hello.c hello.py 2) gcc -Os -I /usr/include/python2.7/ -o hello hello.c -lpython2.7 -lpthread -lm -lutil -ldl 3) ./hello

I would be really grateful if someone could help

jcrudy commented 7 years ago

@florakarniav It looks like most of the terms are getting pruned. You might try a lower penalty setting (the default is 3). You can also disable pruning altogether using enable_pruning=False. However, pruning is there for a reason - it prevents overfitting. Whether you use pruning or not, you should always use a cross-validation strategy of some kind to assess whether your py-earth model can generalize beyond the training data set.

That being said, it is possible what you're seeing is related to some kind of bug in py-earth, but it's hard to say without seeing your data. Would you be able to share the data, or some other data where you see the same effect? Alternatively, can you post any plots or other summaries that would give me an idea of how these models are fitting? Can you describe the differences between the dataset that's working and the one you're having problems with? It might also be helpful to see the output of Earth.trace.

florakarniav commented 7 years ago

With enable_pruning = False it still creates a linear model with this output:

(Intercept) No -1.78827e-05 -2.58472e-05 4.12984e-07
x0 No 14.1642 7.41062 0.0102944
x5 No -0.0878168 -0.126928 0.00202805
x5 No -0.0878168 -0.126928 0.00202805
x5 No -0.0878168 -0.126928 0.00202805
x5 No -0.0878168 -0.126928 0.00202805
x5 No -0.0878168 -0.126928 0.00202805
x0 No 14.1642 7.41062 0.0102944
x5 No -0.0878168 -0.126928 0.00202805

About the data, the strange thing is that they are of exactly same form, but the second - non working file has less data than the first. I even tried copying and pasting the first working file and deleting some of its lines(so that it has equal lines with the non-working file) and it still doesn't work! Of course I keep lines in X.csv and Y.csv alligned.

Here is a link to my output plot. The x with red are the average of the real values. http://www.filedropper.com/figure1_2

jcrudy commented 7 years ago

Oh, I see. I hadn't noticed that most of your basis functions are identical (linear x5 terms). That's definitely a bug. How small is your data set when this happens? You can try allow_linear=False

florakarniav commented 7 years ago

No luck with allow_linear=False. The training dataset has 60 entries when this happens but I doubt this is a size problem. It seems to be working with nothing else (larger or smaller) than this particular dataset. Do you have anything else to suggest so I can maybe save the day?

jcrudy commented 7 years ago

Well, you can over sample the training set. Basically just repeat some or all entries. For example, randomly draw rows with replacement until you get enough rows to get around this bug.

I would like to figure out this bug and fix it in py-earth. Are you able to share the data set at all? You can email it to me privately if you can't share it publicly. My email address is on my github profile. Are you able to you share your code?

jcrudy commented 7 years ago

You might also try adding a small amount of gaussian noise to the training data (using numpy.random.normal). You can combine adding noise with sampling additional rows.

florakarniav commented 7 years ago

Ok, thank you very much for your immediate response and help. I will not able to share my data currently but I will tell you all about its form and I will also attach my code on email(although it is nothing fancy, just based on the simple example).

jcrudy commented 7 years ago

Thanks, @florakarniav. Once I get your email I'll try to reproduce the problem. Please also include system information if you can: operating system, python version, etc.

jcrudy commented 7 years ago

@florakarniav Using your report and the code you sent me, I found a bug in py-earth (issue #135) that was allowing for the duplicate linear terms you were seeing. I've fixed the bug and pushed the change to the master branch. Thanks for finding this bug! I'm closing this issue for now. If you get a chance, please try the latest version from the master branch and confirm whether or not it solves your issue. Feel free to comment here or reopen the issue as needed.

florakarniav commented 7 years ago

Yep, the bug is definitely fixed! That was fast, thank you!