Open snim2 opened 9 years ago
Yes, I agree, multitime needs unit tests, particularly in the stats part. I would prefer to keep everything in C, as otherwise the setup is likely to be unapproachable for anyone else who wants to work on multitime.
[This message also makes me remember that at some point I wanted "-v" to print out the timings of the last command run, so that you could see how long one run took.]
In that case, do you know of a good quickcheck implementation in C? On 6 Apr 2015 20:45, "Laurence Tratt" notifications@github.com wrote:
Yes, I agree, multitime needs unit tests, particularly in the stats part. I would prefer to keep everything in C, as otherwise the setup is likely to be unapproachable for anyone else who wants to work on multitime.
[This message also makes me remember that at some point I wanted "-v" to print out the timings of the last command run, so that you could see how long one run took.]
— Reply to this email directly or view it on GitHub https://github.com/ltratt/multitime/issues/9#issuecomment-90215094.
I'm afraid I have no idea :/
Hmm. I'm not sure how best to move forward here. I'm reluctant to start work on Issue #5 before being certain that the work with arithmetic means is correct. I'm happy to use a C unit testing library (e.g. http://check.sourceforge.net/ ?) but there will be quite a number of unit tests to generate. With the confidence intervals it's difficult to see how best to create the test cases without duplicating the code in format.c.
I'll give it some more thought, but please shout if you have a some good ideas!
I agree -- this is a decision that needs to be done right. I think looking at what other projects do is probably a good start.
So there is a basic .travis.yml
file in the travis-ci branch. There is only one dummy unit test, so it will be trivial to change to a different unit test framework, etc. I was expecting to see a build appear here, but I think the repository is not set up to work with Travis. There should be something in the repo Settings that can turn on the Travis webhooks, then it should be possible for you (but probably not me) to just go to the Travis page and restart a build. Then we'll know if the Travis config I wrote works or not.
I think I've enabled the travis side of things, but it seems you'll have to push to the branch to trigger a build?
In GH, if you open the Settings tab for the repository, click on Service Hooks, and scroll down to click on Travis, you should see a Test Hook button. This will cause the latest built to be re-run. However, it may be that the .travis.yml
file has to be in the master branch
or we have to create a PR for the travis-ci
branch...
Aha! OK, I initiated that. It failed because there isn't a "make test", but at least the Travis integration seems to be working OK.
Aha! I wrote a very basic Makefile
without realising that it would be caught be .gitignore
.
Incidentally, I've never used autoconf / autotools, so I'm not sure how to go from a basic Makefile
to something that fits in with your current build system.
I'll investigate why the version of check
that Travis uses doesn't include checkmk
.
OK, Travis is all sorted, so work on the tricky stuff can start.
Excellent!
I've done some basic refactoring which has been pushed to the travis-ci branch. This may not be the best way to split everything up, so you probably want to check it over. The more mathematical code that can now be easily tested is in statistics.c
, so format.c
really does deal with formatting and little else (except that I have only touched format_other(Conf)
). The resulting code is a little shorter, if you don't count the license headers.
At this point I think testing can begin...
At least at first glance, this looks roughly reasonable. A few comments.
We should generate the Makefile from a Makefile.in so that variables are expanded (at the moment, it won't build on my machine because vars aren't set). "ifdef" in a makefile is a non-portable gnuism however, so we should probably avoid that. The configure script should also probably check for the existence of "checkmk", otherwise you get an inscrutable error.
OK, I will read up on Makefile.in and try and figure something out.
IIRC The only variable that isn't defined within the Makefile is TRAVIS
. travis-ci is based on an old Ubuntu (12.something), so the version of the check
package that can be installed via apt-get
in the .travis.yml
file is too old to include the checkmk
utility. So, this is the reason for having the ifdef
and a lot of extra fuss in the .travis.yml
file to install a newer version of check
.
It should be possible to change the call to make
in the .travis.yml
and remove the ifdef
in the Makefile
. This is how the tests are being run on Travis:
LD_LIBRARY_PATH=$HOME/checksrc/lib make test
but for some reason I couldn't get that syntax to work when calling make
, I think because both the CFLAGS
line has a space in it. You can probably tell that I don't write C very often.
I'm not sure how best to resolve this; one option would be to remove the test_stats.check
file altogether and just include the C file that checkmk
generates. I wouldn't normally commit a generated file to a repository, but since that reduces the number of dependencies perhaps this would be cleaner?
If you use the main Makefile.in for inspiration, you'll probably be more or less OK.
It might be easier to move the "is this on Travis" logic to the configure script?
On #11 @tanzimhoque reported a bug with this data:
Mean Std.Dev. Min Median Max
real 289.409+/-988.4848 9.596 240.783 290.874 296.314
user 287.989+/-904.7777 9.181 240.263 290.186 293.038
sys 0.637+/-9.5890 0.945 0.028 0.108 3.028
(sorry about the formatting). With a -n
of 100 the CI here should be around 1.88
so this is clearly erroneous.
I have made some progress on the testing side, but haven't pushed it yet. I've got some basic test cases and will add something based on the data here.
What I haven't yet done is dealt with test data that includes difficult edge cases like NaN, INF and values close to float min/max. @ltratt do you have any opinions on how best to deal with those edge cases in the code?
I think that one can't generate inputs that are NaN or INF (well, maybe INF, but not NaN), so the code in question could simply assert that these are invalid inputs?
Sounds good to me.
I had a look at the code in the format.c file and it seems the reason for the massive confidence intervals I saw was because the standard deviation being used in the CI calculations is incorrect.
When the results are being printed out the following is done to calculate the standard deviation: sqrt(real_stddev / conf->num_runs)
This should be done before the CI's are calculated
I've notice this has already been fixed in the travis-ci branch, but it is still broken in the master, I can submit an intermediate fix if needed.
Matt H
Many thanks @Ascension1231 that explains why my testing branch didn't find any actual errors :( At least now we know that is safer to merge when it's finished!
Yes, well spotted Matt!
@snim2, in response to https://github.com/ltratt/multitime/issues/9#issuecomment-90220650: I have had reasonable success doing QC-style testing in C with https://github.com/mcandre/qc. I am not a C programmer in general, though, so most of my work has been somewhat naïvely testing others' code as a black box.
There is a bit of an impedance mismatch but works more or less well. Coupled with something like AFL (http://lcamtuf.coredump.cx/afl/) it has given me a fair amount of high-bug-rate-type code testing "for free" on C programs. YMMV of course.
Matthew Howell, one of my undergraduate students, has reported what might be a bug in
multitime
. The bug report from Matthew was that when running very short programs at reasonable values ofn
(e.g.n=30
) the resulting confidence intervals were very large, larger than the mean of the wall-clock time. This was on a very short-running program.This might not be a bug, it may be that the CI is correct, but without being able to see the times of the individual runs it is difficult to tell. On the other hand, there may be an error in the CI calculations, or it may be that my code is not using the
t-value
andz-value
LUTs correctly.It is difficult to see how to track this potential bug down in such a way that we can use the work done now for regression testing in the future, without incurring a huge overhead in adding many unit tests now, but here is one (slightly eccentric) idea:
void format_other(Conf)
into smaller units that can be tested separately. This would create separate functions for calculating means, CIs, etc.python-multitime
which uses cffi to expose the functions created in Step 1. to Python.multitime
to those generated byscipy
or similar, as a ground truth..travis.yml
file which usespython-multitime
from Step 2. and Python quickcheck to test the functions from Step 1. automatically.This is a bit messy, and it would probably be neater to do everything in C...