cover command performance issue

venupec commented 6 years ago

UPDATE:

It is probably the IO issue reading all digest files. I plan to acquire more powerful machines and test.

The cover report spends most of the time - Devel::Cover::DB::cover(). It is the @runs loop that takes up almost 80-90%.

The cover text report files are 28MB for each test suite. Some are of smaller size too like 6MB.

Hello,

We run Devel::Cover on some long running harness test suites that runs into couple of days. We're seeing performance issue with cover <db> -report text command.

Issues I'm seeing:

The first few calls to cover report seems to be taking longer time.
I consistently see that cover text report takes about 66% of the total run time to generate report.
When i run the cover text report it takes a minimum of 1 sec to a maximum of 11 minutes on each test suite.
The time to generate cover text report is directly proportional to the amount of the coverage data collected. The more the data, the more time it takes to generate text report.
The cover report timings are acceptable (below 2 minutes) up to 210 test suites. But after 210 test suites , the cover text report takes consistently longer time (beyond 2 mins to any where upto 11 mins)

Our harness test suite set up:

Our harness system has around 30,000 test suites (.t files).
We've divided these into several smaller chunks for easier execution.
I've customized harness to capture coverage using DEVEL_COVER_OPTIONS env var.
All the harness invocations runs through CI/CD.

What i tried to improve performance?

Did quick benchmark onprint_statement() & print_subroutine() in Devel::Cover::Report::Text::report(). The results were not that bad, at least from what i've seen so far. It took about 3 minutes in total to generate report for each of the 13 test suites.
I tried to generate JSON report, but that report doesn't provide 'covered' or 'uncovered' modules information. So that's of no use for us. But i've customized the code to include covered modules list as well. Still the performance has not improved.

Did any one see these kind of issues before? I'm clueless on what other optimizations i could do on cover command.

I really appreciate your help/insight into this issue. I'm happy to supply additional data supporting the stats above.

Thank you!

pjcj commented 6 years ago

Thanks very much for reporting this. Did you manage to get any further with it? If you still think it's an IO speed problem that can be reasonably solved in hardware then I'm happy to close the ticket. But if you think there's a real problem here then we should investigate further.

I'm aware that merging the DBs can be quite slow and can use a fair bit of memory. Fixing that would probably require a fundamental change to the way the coverage DB is structured. Perhaps by using a real DB. But obviously that's a fair bit of work.

jpsalvesen commented 6 years ago

A query-able sqllite backend? Yes please! That would significantly lower the effort to mine the data, basically opening it up.

One piece of advice aka premature optimization: Start without indexes. Create them after the coverage is gathered but before you generate the report. Inserting without indexes is much faster than inserting with indexes - especially once your data set grows enough for the associated trees to grow deep.

But looking at the code, this would be a very significant rewrite indeed - especially if we are to realize the potential in doing such a change besides the inital retionale (faster reports and merges).

jtk18 commented 1 year ago

I had some luck in speeding up a Devel::Cover coverage run for a large codebase by specifying JSON as an output format instead of Sereal. The parser for the cover output can read this output much faster than the Sereal database; this is important for repositories with a large number of perl files to cover. It's also much easier to create your own parser -- I wrote one in Golang and one in Rust for a code analysis tool for work.

I could probably re-write and improve the golang script pretty easily. I'll look into doing that for faster parsing of JSON outputted runs.

pjcj commented 4 months ago

We have three options at the moment for storing the coverage DB: Serial, JSON and Data::Dumper. I had assumed Serial to be the fastest and most efficient format and so used that by default if it was available. Are we saying that it's faster for Devel::Cover to use JSON (which module?) than Serial?

pjcj / Devel--Cover

cover command performance issue #208