nedbat / coveragepy

The code coverage tool for Python
https://coverage.readthedocs.io
Apache License 2.0
2.89k stars 419 forks source link

Multiprocessing to speed reporting? #1504

Open juandiegopalomino opened 1 year ago

juandiegopalomino commented 1 year ago

Is your feature request related to a problem? Please describe. So I have like a huge project to deal with, with hundreds of files and it takes over 2 minutes to generate the xml report and over 4 minutes to generate the html report.

Describe the solution you'd like Seeing as how, at least for html, there's a loop working over what appears to be autonomous pieces of data (and writes to separate locations), would it not make sense to use the multiprocessing library and cut down time based on the number of cpus available? https://github.com/nedbat/coveragepy/blob/aa62abd5ff33926f44fe4ec9e985ed3d72ea1f9d/coverage/html.py#L230

Describe alternatives you've considered I was thinking of maybe writing a tool in go which performs the html generation after reading the sqlite file, but that is silly as opposed to helping fix the original tool.

Additional context This is a great project, thank you for the hard work!

nedbat commented 1 year ago

Huh, this is a great idea :) I wonder if we'd need to give the user some control over how many cores to use? The HTML report is many files of output, so might be easier to parallelize than the xml (or json, lcov, etc) reports, but I haven't looked into it at all.

juandiegopalomino commented 1 year ago

👍 As a default maybe use all available but it can definitely be made configurable. I think you can parallelize xml here: https://github.com/nedbat/coveragepy/blob/master/coverage/xmlreport.py#L95

nedbat commented 1 year ago

Some exploration reveals there is a bunch of refactoring that would need to be done. During reporting, we have a graph of objects that is not picklable: HtmlReport -> HtmlDataGeneration -> Analysis -> CoverageData -> SqliteDb.

This can be done, but will take some thought and care to do right. Some will be pre-loading data and not holding the reference. Some will be keeping the db name rather than the db so we can re-open the database in the spawned process, I think.

nedbat commented 1 year ago

BTW, the branch with the exploration is here: https://github.com/nedbat/coveragepy/tree/nedbat/parallel-reports-1504-metacov