Trying to figure out what is goodness of fit

russelljjarvis commented 7 years ago

Is a prediction that is less than a power off 10 order of magnitude fit from observation considered good enough?

Generally all the predictions are inside the interval: [10*observation, 0.10*observation], but they don't seem to make close fits inside of that interval.

Below are some examples of sciunit scores that I am unsure about. The tables could both be confirming that sciunit score.sort_key are a good indicator of agreement or a bad indicator. It all depends on whether the of observations and agreements that are the same order of magnitude are good enough.

For example in the Input Resistance Test. Observation and Prediction are 7.49472110^7 Ohm different, this seems like a large difference to me, but perhaps being within the an interval that is `[10observation, 0.10*observation]` is actually good?

observation1.0 ohm  1.206721e+08
prediction1.0 ohm   4.572486e+07
difference          7.494721e+07
ratio               3.789183e-01
score.sort_key      3.343435e-01
=== Model None achieved score Z = -0.97 on test 'InputResistanceTest'. ===

If so that would also explain the time constant test where a difference of 12ms in time constants is good because its a prediction that falls inside [10observation, 0.10observation]

observation1.0 ms  15.734242
prediction1.0 ms    3.160408
difference         12.573835
ratio               0.200862
score.sort_key      0.085486
=== Model None achieved score Z = -1.72 on test 'TimeConstantTest'. ===

And likewise a 'CapacitanceTest', where the prediction falls inside [10observation, 0.10observation] is also good.

observation1.0 F  1.505842e-10
prediction1.0 F   6.911793e-11
difference        8.146624e-11
ratio             4.589986e-01
score.sort_key    5.597462e-01

Generally Resting Potential tests resulted in very good fits/agreement however. [1.25observation, 0.75observation],

observation1.0 mV -68.248143
prediction1.0 mV  -71.021191
difference          2.773047
ratio               1.040632
score.sort_key      0.671194

observation1.0 mV -68.248143
prediction1.0 mV  -63.408085
difference          4.840058
ratio               0.929081
score.sort_key      0.458732

The source code for generating these results is here:

https://github.com/russelljjarvis/neuronunit/edit/dev/unit_test/data_driven_software_tests/test_tests.py

    def try_hard_coded0(self):
        params0 = {'C': '0.000107322241995',
        'a': '0.177922330376',
        'b': '-5e-09',
        'c': '-59.5280130394',
        'd': '0.153178745992',
        'k': '0.000879131572692',
        'v0': '-73.3255584633',
        'vpeak': '34.5214177196',
        'vr': '-71.0211905343',
        'vt': '-46.6016774842'}
        #rheobase = {'value': array(131.34765625) * pA}
        return params0

    def try_hard_coded1(self):
        params1 = {'C': '0.000106983591242',
        'a': '0.480856799107',
        'b': '-5e-09',
        'c': '-57.4022276619',
        'd': '0.0818117582621',
        'k': '0.00114004749537',
        'v0': '-58.4899756601',
        'vpeak': '36.6769758895',
        'vr': '-63.4080852004',
        'vt': '-44.1074682812'}
        #rheobase = {'value': array(106.4453125) * pA}131.34765625
        return params1

taking the absolute log of the differences, helps scale the errors that are within the right power of 10 (in the interval [10*observation, 0.10*observation]), but are still bad fits.

In [2]: import numpy as np

In [3]: np.log( 7.494721e+07)
Out[3]: 18.132294557003476

In [4]: np.log(8.146624e-11)
Out[4]: -23.230832414628807

In [5]: np.log( 2.773047)
Out[5]: 1.0199467156425481

In [6]: np.log( 4.840058)
Out[6]: 1.5769267041278134

rgerkin commented 7 years ago

@russelljjarvis Remember that we are dealing with observations that have a mean and a standard error, and sometimes this standard error is very large. So the ratio you get might be 0.5, but it might still be well within the confidence interval because the literature reports values that vary by more than a factor of 2. So in terms of the resulting model "being realistic" it should be a concern as long as abs(Z)<2.

However, given that there are not so many tests to optimize on, I would think it should be possible to drive the model towards Z=0 (and a ratio of 0) on many of them at once. So one thing to check its whether you can get something close to Z=0 (and a ratio of 0) when you run suites containing only one test at a time. Then also check the exhaustive search and make sure that the optima you are getting one running many tests in a suite is reasonable. In other words, make sure that you cannot manually change the values in any way and improve upon the fit, driving one or more Z-scores towards 0 while not pushing any others away from 0.

russelljjarvis commented 7 years ago

I can see that it could be that optimizing over 10 criteria, is too much of a conflicted priority.

rgerkin commented 6 years ago

@russelljjarvis Any other thoughts on this matter? Or shall I close this?

rgerkin commented 6 years ago

@russelljjarvis Any questions or comments?

scidash / neuronunit

Trying to figure out what is goodness of fit #120