Unit testing needs a lot of work

rlabbe commented 8 years ago

I mostly use graphs to 'eyeball' how things are working. Tthere are well known performance bounds, such as Cramer-Rao, for these filters. I need to think through what I want re unit testing and turn it into a PR. FilterPy also probably needs supporting functions - if I can turn a Cramer-Rao into a stand alone function it should be part of the general library, for example.

There are better bounds than Cramer-Rao, but they can be a bear to compute. Section 2.7 of Challa covers this to some extent.

It occurs to me that I don't test things like immutability of things like self.F. What if I force an exception - are things in a standard state? Etc. All pameter inputs need to be tested. And so on.

Anyway, it is time to think through a solid testing plan, set some reasonable standards for what tests must be done for any functionality, and then execute on at least some of it.

This is important - this was designed largely as a pedagogical tool, but Ph.D students are using it in their research, some companies are using it for there work - there are real-world consequences to bugs.

ggsato commented 6 years ago

Hi, thanks for this very educational, but very practical, tool.

I have tested this for a couple of months as a key tracking component of our new computer vision vehicle counting system. And I decided to use it for our product.

So, I would like to do something to improve filterpy. And perhaps, this is a good starting point.

Anyway, it is time to think through a solid testing plan, set some reasonable standards for what tests must be done for any functionality, and then execute on at least some of it.

Did you get any testing plan?

rlabbe commented 6 years ago

Well, I use py.test for testing. Every module has a test folder, with quite a few tests written already.

However, as mentioned a lot of tests are visual in nature (run a filter, plot against measurements), which of course is useless for automated testing.

I still stand by what I wrote above - we need to use statistical measures to detect if the filters are performing correctly or not. And I strongly suspect that whatever measure is used should be part of the core library, since if it is useful to the library, it will be equally useful to users of the library to test the performance of their filters.

Then there is more simple kinds of tests - does the code work if passed a list instead of an array(if it should). If it works with a list, does it work with a tuple? Does it properly raise an exception if passed an array of the wrong shape? Are the outputs and object variables shaped correctly? Stuff like that.

As an aside, I'm curious about your use. As a pedagogical tool, I chose to write everything in Python. For performance, cython would be a better choice, but then the code becomes more or less unreadable for most people. I use FilterPy in my professional work, but only for analysis and prototyping - everything that runs in production is pure C++.

In any case, I welcome a contribution. If you feel you can tackle the statistical component, by all means give it a go. However, there is a still unwritten chapter in my book on performance measurement, so know I'd be pretty heavy handed in my opinions on code design and quality since that work would become part of the external API of FilterPy. For tests I just ask that the code be reasonably PEP8 compliant, with NumPy style docstrings and coding conventions.

ggsato commented 6 years ago

Well, I use py.test for testing. Every module has a test folder, with quite a few tests written already.

Oops, I missed them. I have just finished running successfully both by py.test, and some independently to see their plots. And now I think I understand those simple function tests, and visual ones.

I still stand by what I wrote above - we need to use statistical measures to detect if the filters are performing correctly or not.

I agree. filterpy deals with data. So depending on its types, this tool behaves differently.

I imagine like this. A series of test data points is generated from a known probability distribution. Also noise from another one. The result looks like a joint probability distribution of those two? I am not a math guy, so not sure. But if it does, I guess we can assume how much errors filter estimates could contain.

In a real world, for example, on a snow day. Its visual becomes noisy. But how much snow a certain system can handle reliably? And when it should tell "Don't trust me anymore!"? I expect such a test method could answer such a question as well.

As an aside, I'm curious about your use. ... everything that runs in production is pure C++.

In short, the performance is good enough.

A major sensor input to track an object is a detection by deep learning. And our system has been running on servers for years. But today, we are moving parts of the system more to so-called edge devices. And those devices are based on NVIDIA's Jetson TX2 that comes with 6 cores CPU and a powerful GPU.

The performance goal of each edge device is to analyze video in realtime, like at 25 fps ~ 30 fps. But even today's latest GPU takes around 100ms to detect objects on a VGA size BGR image. So a detection could be done like every 5 frames. That's how a filter fills the gap by estimates.

Of course, the faster, the better. But under this design, the filter does its job fast enough to keep the pace.

If you feel you can tackle the statistical component, by all means give it a go.

OK. I am going to share our test ideas and results.

For tests I just ask that the code be reasonably PEP8 compliant, with NumPy style docstrings and coding conventions.

I see. I'm going to learn your way first, and will start to contribute.

ggsato commented 6 years ago

I imagine like this. A series of test data points is generated from a known probability distribution. Also noise from another one. The result looks like a joint probability distribution of those two?

I had a discussion with a math guy, and will start by focusing on Normal distribution use cases.

Suppose test data is generated from a Normal distribution T, noise data is from another Normal distribution N. Then, T+N is still a Normal distribution, and which stays as Normal during Kalman filter's continuous update+predict(prior -> posterior -> prior -> ...) cycle.

rlabbe commented 6 years ago

This handout from Paul Zarchan parallels my thinking on the topic:

http://iaac.technion.ac.il/workshops/2010/KFhandouts/LectKF24.pdf

In other words, we can compute the optimal performance of a filter, and then test a real world filter against that.

Also, Crassidis discusses at a very high level using chi-square distribution to test the performance of a filter.

ggsato commented 6 years ago

Thanks for clarifying this.

I wasn't able to completely follow the paper, but understand CRLB is equal to the best(optimal) result with no process noise, and from an infinite initial P.

By the way, from your references, I started reading the book by Huber, Robust Statistics, Second Edition, and learned that there are three important features about robustness.

Efficiency: It should have a reasonably good (optimal or nearly optimal) efficiency at the assumed model.
Stability: It should be robust in the sense that small deviations from the model assumptions should impair the performance only slightly, that is, the latter(described, say, in terms of the asymptotic variance of an estimate, or of the level and power of a test) should be close to the nominal value calculated at the model.
Breakdown: Somewhat larger deviations from the model should not cause a catastrophe.

Then, the CRLB is about efficiency. And I guess, another approach from the other way around, one with no measurement error, would exist.

And I realized that I am interested in Stability.

But anyway, can we take these three as features we should discuss here, I mean, statistical tests?

vhartman commented 4 years ago

Did one of you ever start with something like the things mentioned above? I wasn't able to find anything so far.

rlabbe / filterpy

Unit testing needs a lot of work #24