sethjuarez / numl

Machine Learning for .NET
http://numl.net
MIT License
430 stars 104 forks source link

LinearRegression Accuracy Always Zero #38

Closed ghost closed 8 years ago

ghost commented 8 years ago

While taking a more thorough look into the Linear Regression implementation, I'm seeing that Accuracy tends to report as 0%. Here is the code that is currently being used in (dev branch) Learner.cs:

            // testing            
            object[] test = GetTestExamples(testingSlice, examples);
            double accuracy = 0;

            for (int j = 0; j < test.Length; j++)
            {
                // items under test
                object o = test[j];

                // get truth
                var truth = Ject.Get(o, descriptor.Label.Name);

                // if truth is a string, sanitize
                if (descriptor.Label.Type == typeof(string))
                    truth = StringHelpers.Sanitize(truth.ToString());

                // make prediction
                var features = descriptor.Convert(o, false).ToVector();

                var p = model.Predict(features);
                var pred = descriptor.Label.Convert(p);

                // assess accuracy
                if (truth.Equals(pred))
                    accuracy += 1;
            }

            // get percentage correct
            accuracy /= test.Length;

Then this is consumed later in Learner.Best:

            var q = from m in models
                    where m.Accuracy == (models.Select(s => s.Accuracy).Max())
                    select m;

            return q.FirstOrDefault();

So basically, it iterates through the training slice, makes the prediction, and then assesses the success of the prediction against the truth. But currently, it only has one implementation of assessment: truth.Equals(pred). This then is consumed in the Learner.Best() being getting the one with the highest (max) value of Accuracy.

This approach means that unless two doubles are exactly equal (not likely except for possibly trivial data) that LinearRegression will always produce 0% Accuracy.

I wanted to abstract this out, but I wanted to get thoughts on how to approach this, as there are a lot of possible routes forward.

We could...

I personally waver between the TestOption approach and the Learner changes. Each has its pros and cons.

With the TestOption approach, we can easily keep from having breaking changes. But we would then have to change the Learner.Best() method depending on what the options instance is, and we end up with a switch statement, or worse, an if-then-else chain.

With the Learner singleton changes, we could more cleanly address the various capabilities of the Learner class. But this would probably entail breaking changes. I could actually write an ILearnerThing interface that has a default implementation that uses the current static class as-is, and this would avoid breaking changes. However, going forward, we would have a fragmented approach to using the library. Also, this would possibly (probably?) incorporate using DI of some sort which brings along with it more design decisions, i.e. complexity.

So, those are my thoughts. The goal is simply to get some accuracy with LinearRegression and do it in such a way that if we get a good statistician personage (or maybe one of you already is), it gives them easy access to a more robust assessment of accuracy without getting too YAGNI.

bdschrisk commented 8 years ago

Changes coming soon with new Scoring functionality which addresses this issue.

bdschrisk commented 8 years ago

Fixed in new version.