ltratt / multitime

Time command execution over multiple executions
http://tratt.net/laurie/src/multitime/
MIT License
115 stars 13 forks source link

Testing the calculations in format.c automatically #9

Open snim2 opened 9 years ago

snim2 commented 9 years ago

Matthew Howell, one of my undergraduate students, has reported what might be a bug in multitime. The bug report from Matthew was that when running very short programs at reasonable values of n (e.g. n=30) the resulting confidence intervals were very large, larger than the mean of the wall-clock time. This was on a very short-running program.

This might not be a bug, it may be that the CI is correct, but without being able to see the times of the individual runs it is difficult to tell. On the other hand, there may be an error in the CI calculations, or it may be that my code is not using the t-value and z-value LUTs correctly.

It is difficult to see how to track this potential bug down in such a way that we can use the work done now for regression testing in the future, without incurring a huge overhead in adding many unit tests now, but here is one (slightly eccentric) idea:

  1. Split the calculations in void format_other(Conf) into smaller units that can be tested separately. This would create separate functions for calculating means, CIs, etc.
  2. Create a separate repository called python-multitime which uses cffi to expose the functions created in Step 1. to Python.
  3. In this repository (?) create a number of assertions against the functions from Step 1. that can be tested using Python quickcheck. The advantage of doing this is that we can compare the standard deviations, confidence intervals, etc. from multitime to those generated by scipy or similar, as a ground truth.
  4. In this repository create a .travis.yml file which uses python-multitime from Step 2. and Python quickcheck to test the functions from Step 1. automatically.

This is a bit messy, and it would probably be neater to do everything in C...

ltratt commented 9 years ago

Yes, I agree, multitime needs unit tests, particularly in the stats part. I would prefer to keep everything in C, as otherwise the setup is likely to be unapproachable for anyone else who wants to work on multitime.

[This message also makes me remember that at some point I wanted "-v" to print out the timings of the last command run, so that you could see how long one run took.]

snim2 commented 9 years ago

In that case, do you know of a good quickcheck implementation in C? On 6 Apr 2015 20:45, "Laurence Tratt" notifications@github.com wrote:

Yes, I agree, multitime needs unit tests, particularly in the stats part. I would prefer to keep everything in C, as otherwise the setup is likely to be unapproachable for anyone else who wants to work on multitime.

[This message also makes me remember that at some point I wanted "-v" to print out the timings of the last command run, so that you could see how long one run took.]

— Reply to this email directly or view it on GitHub https://github.com/ltratt/multitime/issues/9#issuecomment-90215094.

ltratt commented 9 years ago

I'm afraid I have no idea :/

snim2 commented 9 years ago

Hmm. I'm not sure how best to move forward here. I'm reluctant to start work on Issue #5 before being certain that the work with arithmetic means is correct. I'm happy to use a C unit testing library (e.g. http://check.sourceforge.net/ ?) but there will be quite a number of unit tests to generate. With the confidence intervals it's difficult to see how best to create the test cases without duplicating the code in format.c.

I'll give it some more thought, but please shout if you have a some good ideas!

ltratt commented 9 years ago

I agree -- this is a decision that needs to be done right. I think looking at what other projects do is probably a good start.

snim2 commented 9 years ago

So there is a basic .travis.yml file in the travis-ci branch. There is only one dummy unit test, so it will be trivial to change to a different unit test framework, etc. I was expecting to see a build appear here, but I think the repository is not set up to work with Travis. There should be something in the repo Settings that can turn on the Travis webhooks, then it should be possible for you (but probably not me) to just go to the Travis page and restart a build. Then we'll know if the Travis config I wrote works or not.

ltratt commented 9 years ago

I think I've enabled the travis side of things, but it seems you'll have to push to the branch to trigger a build?

snim2 commented 9 years ago

In GH, if you open the Settings tab for the repository, click on Service Hooks, and scroll down to click on Travis, you should see a Test Hook button. This will cause the latest built to be re-run. However, it may be that the .travis.yml file has to be in the master branch or we have to create a PR for the travis-ci branch...

ltratt commented 9 years ago

Aha! OK, I initiated that. It failed because there isn't a "make test", but at least the Travis integration seems to be working OK.

snim2 commented 9 years ago

Aha! I wrote a very basic Makefile without realising that it would be caught be .gitignore.

Incidentally, I've never used autoconf / autotools, so I'm not sure how to go from a basic Makefile to something that fits in with your current build system.

I'll investigate why the version of check that Travis uses doesn't include checkmk.

snim2 commented 9 years ago

OK, Travis is all sorted, so work on the tricky stuff can start.

ltratt commented 9 years ago

Excellent!

snim2 commented 9 years ago

I've done some basic refactoring which has been pushed to the travis-ci branch. This may not be the best way to split everything up, so you probably want to check it over. The more mathematical code that can now be easily tested is in statistics.c, so format.c really does deal with formatting and little else (except that I have only touched format_other(Conf)). The resulting code is a little shorter, if you don't count the license headers.

At this point I think testing can begin...

ltratt commented 9 years ago

At least at first glance, this looks roughly reasonable. A few comments.

We should generate the Makefile from a Makefile.in so that variables are expanded (at the moment, it won't build on my machine because vars aren't set). "ifdef" in a makefile is a non-portable gnuism however, so we should probably avoid that. The configure script should also probably check for the existence of "checkmk", otherwise you get an inscrutable error.

snim2 commented 9 years ago

OK, I will read up on Makefile.in and try and figure something out.

IIRC The only variable that isn't defined within the Makefile is TRAVIS. travis-ci is based on an old Ubuntu (12.something), so the version of the check package that can be installed via apt-get in the .travis.yml file is too old to include the checkmk utility. So, this is the reason for having the ifdef and a lot of extra fuss in the .travis.yml file to install a newer version of check.

It should be possible to change the call to make in the .travis.yml and remove the ifdef in the Makefile. This is how the tests are being run on Travis:

LD_LIBRARY_PATH=$HOME/checksrc/lib make test

but for some reason I couldn't get that syntax to work when calling make, I think because both the CFLAGS line has a space in it. You can probably tell that I don't write C very often.

I'm not sure how best to resolve this; one option would be to remove the test_stats.check file altogether and just include the C file that checkmk generates. I wouldn't normally commit a generated file to a repository, but since that reduces the number of dependencies perhaps this would be cleaner?

ltratt commented 9 years ago

If you use the main Makefile.in for inspiration, you'll probably be more or less OK.

It might be easier to move the "is this on Travis" logic to the configure script?

snim2 commented 9 years ago

On #11 @tanzimhoque reported a bug with this data:

     Mean                Std.Dev.    Min         Median      Max

real 289.409+/-988.4848 9.596 240.783 290.874 296.314
user 287.989+/-904.7777 9.181 240.263 290.186 293.038
sys 0.637+/-9.5890 0.945 0.028 0.108 3.028

(sorry about the formatting). With a -n of 100 the CI here should be around 1.88 so this is clearly erroneous.

I have made some progress on the testing side, but haven't pushed it yet. I've got some basic test cases and will add something based on the data here.

What I haven't yet done is dealt with test data that includes difficult edge cases like NaN, INF and values close to float min/max. @ltratt do you have any opinions on how best to deal with those edge cases in the code?

ltratt commented 9 years ago

I think that one can't generate inputs that are NaN or INF (well, maybe INF, but not NaN), so the code in question could simply assert that these are invalid inputs?

snim2 commented 9 years ago

Sounds good to me.

Mhowell494 commented 9 years ago

I had a look at the code in the format.c file and it seems the reason for the massive confidence intervals I saw was because the standard deviation being used in the CI calculations is incorrect.

When the results are being printed out the following is done to calculate the standard deviation: sqrt(real_stddev / conf->num_runs)

This should be done before the CI's are calculated

I've notice this has already been fixed in the travis-ci branch, but it is still broken in the master, I can submit an intermediate fix if needed.

Matt H

snim2 commented 9 years ago

Many thanks @Ascension1231 that explains why my testing branch didn't find any actual errors :( At least now we know that is safer to merge when it's finished!

ltratt commented 9 years ago

Yes, well spotted Matt!

zbentley commented 6 years ago

@snim2, in response to https://github.com/ltratt/multitime/issues/9#issuecomment-90220650: I have had reasonable success doing QC-style testing in C with https://github.com/mcandre/qc. I am not a C programmer in general, though, so most of my work has been somewhat naïvely testing others' code as a black box.

There is a bit of an impedance mismatch but works more or less well. Coupled with something like AFL (http://lcamtuf.coredump.cx/afl/) it has given me a fair amount of high-bug-rate-type code testing "for free" on C programs. YMMV of course.