rohancme / wolf

Find problems in test coverage, add and improve it.
GNU General Public License v2.0
2 stars 1 forks source link

Using PITest Results to reduce the number of Randoop tests #19

Closed rohancme closed 8 years ago

rohancme commented 8 years ago

1) Get overall mutation coverage score. 2) Drop a random test 3) Measure mutation coverage; Keep test if score drops, others keep in trash. 4) How many tests do we have left? (edited)

Ideally, you would see something like, we generated 11,770 random tests, and this keep 173 of them.

rohancme commented 8 years ago

@chrisparnin . Well it looks like Randoop isn't really doing anything useful on the Rome repo after a certain point. There's no increase in either line or mutation coverage after the first set of generated tests. Randoop generates test classes with 300-500 tests in each class. These are the results I got by working at the class level. I'm sure if I went down to the individual test level, these numbers could be reduced further.

Both promising (in terms of being able to cut down the number of tests) and disappointing (in terms of Randoop's contribution to line/mutation coverage improvement):

Here's the line + mutation coverage report before adding the randoop tests:

rome_io_before



After adding ~500 of the randoop tests [Line up by 5% and Mutation up by 6%]

rome_io_after_reduction



After adding the remaining tests. Bringing the total to ~12000 [No change whatsover]:

rome_io_after

chrisparnin commented 8 years ago

Actually, i think this is an exciting result. We already know randoop is just taking shots in the dark. Basically, it was able to get 5% test coverage increase. Cool.

But 12,000 is a ridiculous amount of tests cases to have to keep around.

What is good is that your idea of using line coverage and mutation coverage actually does work, and the simple implementation approach worked. It would be interesting to explore failures/passes as a metric.

But here's were we can gain something useful. Beyond reducing the number of test cases, it would be very interesting to know what type of random test cases have the most probable chance of generating more coverage (or errors). That way we know empirically, what is worth generating randomly, and what is not worth generating. For example, the "obj.equals(obj)" test has the best defect discovery rate. That would be a useful result for the testing community who can use that to build better testing tools.

rohancme commented 8 years ago

Currently working by cutting down the number of test files. Number of tests per file can be adjusted with a command line option for the tool. The lower the number the tests, the longer the reduction process takes since I'm essentially running pitest (a 3-5 min process) to evaluate each of the test files.