genhtml is too slow compared to old version

running on ubuntu20.04, with 16 cores, 32G memory. tested using the same tracefile(about 700 files, 300k+ lines, 20k+ functions, 100k+ branches), genhtml command:

time genhtml --output-directory $OUTPUT --branch-coverage --prefix $PREFIX coverage.info

with ubuntu20.04 default lcov 1.14:

real    0m23.972s
user    0m22.009s
sys     0m1.823s

with lcov 2.0:

real    1m14.681s
user    1m12.299s
sys     0m1.970s

with lcov 2.0(--parallel):

real    2m36.937s
user    21m53.142s
sys     14m31.093s

It seems that adding --parallel makes things worse, which consumes more CPU and memory, but have no advantage.(I thought parallelism should be faster?)

In general, parallel execution is faster if you have enough work that can be done independently, have sufficient resources to allow parallel execution without thrashing, and the overhead of parallelism is less than time taken by the particular task. It is definitely possible to create examples where one or more of these is violated.

Also, there is a --profile option - to ask the tool to tell you where it is spending time. The data may be helpful to figure out what is happening and/or to identify bottlenecks for optimization.

Probably worth to use current TOT, as there have been some performance optimizations since the 2.0 release. (Those optimizations are typiclly WRT our examples - which may or may not be representative of your examples.)

Here is some summary info generated from genhtml.json(--profile output), LCOV version 2.1-beta2.3-g6141622

without --parallel option:

{"test1": {"overall": 69.693657875061, "html": {"cnt": 884, "min": 0.00220179557800293, "max": 0.715844869613647, "sum": 37.969685792922974, "avg": 0.042952133249912865}, "load": {"cnt": 699, "min": 0.000123023986816406, "max": 0.0527210235595703, "sum": 2.5745861530303955, "avg": 0.0036832419928904086}, "categorize": {"cnt": 699, "min": 0.000234842300415039, "max": 0.246188163757324, "sum": 8.29749059677124, "avg": 0.011870515875209214}, "dir": {"cnt": 184, "min": 0.00324487686157227, "max": 0.0333459377288818, "sum": 1.4431145191192625, "avg": 0.007843013690865557}, "file": {"cnt": 699, "min": 0.00352787971496582, "max": 1.24869990348816, "sum": 56.45982718467713, "avg": 0.08077228495661964}}, "test2": {"overall": 53.7329921722412, "html": {"cnt": 681, "min": 0.00214409828186035, "max": 0.708955049514771, "sum": 28.689658641815186, "avg": 0.04212872047256268}, "load": {"cnt": 533, "min": 0.00011897087097168, "max": 0.0503799915313721, "sum": 1.939279317855835, "avg": 0.003638422735189184}, "categorize": {"cnt": 533, "min": 0.000257968902587891, "max": 0.220337152481079, "sum": 6.480504274368286, "avg": 0.012158544604818548}, "dir": {"cnt": 147, "min": 0.00304913520812988, "max": 0.028817892074585, "sum": 1.0736997127532961, "avg": 0.007304079678593851}, "file": {"cnt": 533, "min": 0.00348591804504395, "max": 1.21855282783508, "sum": 42.94275641441345, "avg": 0.08056802329158246}}, "test3": {"overall": 28.3487250804901, "html": {"cnt": 379, "min": 0.00214195251464844, "max": 0.499402046203613, "sum": 15.58631682395935, "avg": 0.04112484650121201}, "load": {"cnt": 288, "min": 0.000129938125610352, "max": 0.0341091156005859, "sum": 1.0401902198791504, "avg": 0.0036117715968026053}, "categorize": {"cnt": 288, "min": 0.000262975692749023, "max": 0.122973203659058, "sum": 3.4219408035278325, "avg": 0.011881738901138307}, "dir": {"cnt": 90, "min": 0.00344610214233398, "max": 0.0301930904388428, "sum": 0.6219146251678466, "avg": 0.006910162501864962}, "file": {"cnt": 288, "min": 0.00318288803100586, "max": 0.779124021530151, "sum": 23.067707538604736, "avg": 0.08009620673126644}}, "test4": {"overall": 18.4163129329681, "html": {"cnt": 211, "min": 0.00220489501953125, "max": 0.738096952438354, "sum": 9.605504512786867, "avg": 0.04552371807007994}, "load": {"cnt": 166, "min": 0.000113964080810547, "max": 0.0519740581512451, "sum": 0.649959087371826, "avg": 0.003915416188986903}, "categorize": {"cnt": 166, "min": 0.000352144241333008, "max": 0.218640089035034, "sum": 2.2901556491851807, "avg": 0.013796118368585426}, "dir": {"cnt": 44, "min": 0.00253701210021973, "max": 0.0253028869628906, "sum": 0.33654904365539556, "avg": 0.00764884190125899}, "file": {"cnt": 166, "min": 0.0033719539642334, "max": 1.25581407546997, "sum": 14.650397777557371, "avg": 0.08825540829853838}}, "test5": {"overall": 8.58937692642212, "html": {"cnt": 127, "min": 0.00346803665161133, "max": 0.148995161056519, "sum": 4.690181016921997, "avg": 0.03693055918836218}, "load": {"cnt": 106, "min": 0.00026392936706543, "max": 0.0105841159820557, "sum": 0.32061481475830084, "avg": 0.003024668063757555}, "categorize": {"cnt": 106, "min": 0.000406026840209961, "max": 0.0425150394439697, "sum": 1.0230870246887207, "avg": 0.009651764383855855}, "dir": {"cnt": 20, "min": 0.00366497039794922, "max": 0.0290079116821289, "sum": 0.1894917488098145, "avg": 0.009474587440490725}, "file": {"cnt": 106, "min": 0.00561189651489258, "max": 0.248773097991943, "sum": 6.9603416919708225, "avg": 0.06566360086764927}}, "test6": {"overall": 7.45630288124084, "html": {"cnt": 98, "min": 0.00310301780700684, "max": 0.227667808532715, "sum": 4.2659528255462655, "avg": 0.043530130872921076}, "load": {"cnt": 76, "min": 0.000192880630493164, "max": 0.0157678127288818, "sum": 0.2864232063293456, "avg": 0.0037687263990703367}, "categorize": {"cnt": 76, "min": 0.000235080718994141, "max": 0.0471210479736328, "sum": 0.8142759799957274, "avg": 0.01071415763152273}, "dir": {"cnt": 21, "min": 0.0032801628112793, "max": 0.0259180068969727, "sum": 0.15669584274292, "avg": 0.007461706797281905}, "file": {"cnt": 76, "min": 0.00458002090454102, "max": 0.345099925994873, "sum": 6.142184734344482, "avg": 0.08081822018874318}}}

with --parallel option:

{"test1": {"overall": 139.099180936813, "html": {"cnt": 884, "min": 0.00317001342773438, "max": 0.79960298538208, "sum": 38.1168737411499, "avg": 0.04311863545378948}, "load": {"cnt": 699, "min": 0.000245094299316406, "max": 0.0875239372253418, "sum": 3.3733892440795894, "avg": 0.00482602180841143}, "categorize": {"cnt": 699, "min": 0.000663042068481445, "max": 0.232724905014038, "sum": 11.837536096572876, "avg": 0.01693495865031885}, "directory": {"cnt": 184, "min": 1.12537908554077, "max": 3.34291887283325, "sum": 444.03566646575933, "avg": 2.413237317748692}, "file": {"cnt": 699, "min": 1.06501793861389, "max": 3.74806904792786, "sum": 1511.8968088626862, "avg": 2.1629425019494795}, "child": {"cnt": 884, "min": 0.0118467807769775, "max": 1.38333296775818, "sum": 87.0812258720398, "avg": 0.09850817406339343}}, "test2": {"overall": 87.1324789524078, "html": {"cnt": 681, "min": 0.00324201583862305, "max": 0.801820993423462, "sum": 31.7707302570343, "avg": 0.04665305470930147}, "load": {"cnt": 533, "min": 0.000234127044677734, "max": 0.0558700561523438, "sum": 2.3206124305725098, "avg": 0.004353869475745797}, "categorize": {"cnt": 533, "min": 0.000843048095703125, "max": 0.249961853027344, "sum": 8.851559162139893, "avg": 0.016607052837035445}, "directory": {"cnt": 147, "min": 0.778333902359009, "max": 3.31218600273132, "sum": 280.12529325485224, "avg": 1.9056142398289269}, "file": {"cnt": 533, "min": 0.792023897171021, "max": 3.2999119758606, "sum": 893.5714342594147, "avg": 1.6764942481414908}, "child": {"cnt": 681, "min": 0.0102710723876953, "max": 1.3947548866272, "sum": 66.63386845588684, "avg": 0.097847090243593}}, "test3": {"overall": 27.4819920063019, "html": {"cnt": 379, "min": 0.0034639835357666, "max": 0.458012104034424, "sum": 16.648162126541138, "avg": 0.043926549146546534}, "load": {"cnt": 288, "min": 0.000232934951782227, "max": 0.0298130512237549, "sum": 1.1898078918457031, "avg": 0.0041312774022420245}, "categorize": {"cnt": 288, "min": 0.000811100006103516, "max": 0.153878927230835, "sum": 4.552825689315795, "avg": 0.01580842253234651}, "directory": {"cnt": 90, "min": 0.458359956741333, "max": 1.46185898780823, "sum": 86.64420795440677, "avg": 0.9627134217156308}, "file": {"cnt": 288, "min": 0.456877946853638, "max": 1.90998005867004, "sum": 264.52932047843933, "avg": 0.918504584994581}, "child": {"cnt": 379, "min": 0.00936508178710938, "max": 0.746864080429077, "sum": 32.218116521835334, "avg": 0.08500822301275814}}, "test4": {"overall": 12.3653080463409, "html": {"cnt": 211, "min": 0.00300908088684082, "max": 0.815544128417969, "sum": 9.816606521606444, "avg": 0.046524201524201156}, "load": {"cnt": 166, "min": 0.000207901000976562, "max": 0.0422279834747314, "sum": 0.664856195449829, "avg": 0.004005157803914632}, "categorize": {"cnt": 166, "min": 0.000793933868408203, "max": 0.219452857971191, "sum": 2.8899414539337154, "avg": 0.017409285867070575}, "directory": {"cnt": 44, "min": 0.32911205291748, "max": 1.22095799446106, "sum": 31.903474807739258, "avg": 0.7250789729031649}, "file": {"cnt": 166, "min": 0.310504913330078, "max": 2.15262699127197, "sum": 111.91055321693419, "avg": 0.6741599591381577}, "child": {"cnt": 211, "min": 0.00918817520141602, "max": 1.33370518684387, "sum": 18.484768152236935, "avg": 0.08760553626652576}}, "test5": {"overall": 5.10910105705261, "html": {"cnt": 127, "min": 0.00406289100646973, "max": 0.173205852508545, "sum": 5.150458574295043, "avg": 0.040554791923583015}, "load": {"cnt": 106, "min": 0.000497817993164062, "max": 0.0174529552459717, "sum": 0.38253927230834966, "avg": 0.0036088610595127325}, "categorize": {"cnt": 106, "min": 0.000803947448730469, "max": 0.0517008304595947, "sum": 1.4405694007873533, "avg": 0.013590277365918427}, "directory": {"cnt": 20, "min": 0.178601980209351, "max": 0.47684907913208, "sum": 7.307308197021484, "avg": 0.36536540985107424}, "file": {"cnt": 106, "min": 0.158262968063354, "max": 0.692163944244385, "sum": 38.51706671714783, "avg": 0.3633685539353569}, "child": {"cnt": 127, "min": 0.00975990295410156, "max": 0.29651403427124, "sum": 9.54466986656189, "avg": 0.07515488083907}}, "test6": {"overall": 3.60685014724731, "html": {"cnt": 98, "min": 0.00437092781066895, "max": 0.213585138320923, "sum": 4.359879493713379, "avg": 0.044488566262381415}, "load": {"cnt": 76, "min": 0.000296115875244141, "max": 0.0146920680999756, "sum": 0.2980520725250243, "avg": 0.003921737796381899}, "categorize": {"cnt": 76, "min": 0.00069880485534668, "max": 0.0452780723571777, "sum": 1.024108409881592, "avg": 0.013475110656336737}, "directory": {"cnt": 21, "min": 0.151760101318359, "max": 0.423412084579468, "sum": 6.200369358062746, "avg": 0.2952556837172736}, "file": {"cnt": 76, "min": 0.127248048782349, "max": 0.550164937973022, "sum": 23.31259274482727, "avg": 0.3067446413793062}, "child": {"cnt": 98, "min": 0.0101768970489502, "max": 0.323693990707397, "sum": 7.506403207778928, "avg": 0.07659595109978498}}}

It seems that file and directory increase a lot when --parallel enabled. As the total number of files increases, the processing time per file seems to increase as well.

I fear, not helpful. If you want me to take a look at the data: please attach the full .json files. If you feel that there is some proprietary information in that data: then munge pathnames to something meaningless (but make sure you retain the full subdirectory structure). You can also use the spreadsheet.py tool from the scripts directory, to turn the profile into an excel spreadsheet, which may be easier to analyze.

If 'file' times are going up: that suggests that your filesystem or network cannot keep up with the demand, and is a bottleneck. Locally: that often means that you need to copy data to compute farm-local (fast) storage - rather than trying to retrieve it from some far-away network share.

All files are processed locally rather than retrieved from the network. As you concerned, disk storage might be slow, so this time the output directory was mounted as tmpfs, but that didn't help.

.json files: with_parallel.json without_parallel.json

By the way spreadsheet.py doesn't seem to work? As the output stats.xlsx seems not have something helpful.

The data suggests that you are running out of resources/thrashing. We see that the time spent executing in the child process is short (0.11s average, 0.16 std. dev, total of 95.48s), whereas the fork-to-reap time is much longer (2.22s average, 0.66 std. dev - whopping 1963s sum). I suggest to set a max memory threshold - say, 85% of total - and see if throttling helps.

Note that we do expect the 'child' processing time to be longer than the single-threaded processing time for the same file or directory object. This reflects the time to fork the process, to serialize/deserialize data, etc. When we have enough work to do and the tasks are sufficiently expensive, then the advantage we see from parallel execution makes up for the additional overhead. When we don't have much to do...it doesn't.

It is also interesting that your data seems to show that lcov 2.0/2.1 is slower than 1.14 in single-threaded mode. I wonder if that is still true, if you enable branch coverage. I had seen that 2.0 was about 2-3X faster than lcov/1.15 - but I haven't tried to benchmark in quite a long time - and older versions would not work in our environment in any case.

Have written a script to generate sample project, the results may be reproducible. gen_sample_proj.tar.gz

# To generate sample project with 800 files, 100 functions per file
./gen_sample_proj.py 800 100
# and then run
cd sample_800_100 && ./build.sh

Results tested on another system with 32cores, 64G mem. I looked at the resource usage and that the memory is definitely enough:	LCOV version	100 files 100 funcs/file	200 files 100 funcs/file	400 files 100 funcs/file
1.14	1.14user 0.03system 0:01.18elapsed 99%CPU	2.15user 0.12system 0:02.28elapsed 100%CPU	4.44user 0.22system 0:04.67elapsed 99%CPU	9.31user 0.55system 0:09.88elapsed 99%CPU
1.15	1.11user 0.05system 0:01.17elapsed 99%CPU	2.14user 0.14system 0:02.28elapsed 99%CPU	4.28user 0.23system 0:04.51elapsed 99%CPU	8.89user 0.51system 0:09.42elapsed 99%CPU
1.16	1.08user 0.07system 0:01.15elapsed 99%CPU	2.12user 0.11system 0:02.24elapsed 99%CPU	4.28user 0.16system 0:04.44elapsed 99%CPU	8.89user 0.41system 0:09.31elapsed 99%CPU
2.1 single-threaded (one "file" time)	4.91user 0.17system 0:05.09elapsed 99%CPU (0.0318551063537598)	9.84user 0.25system 0:10.10elapsed 99%CPU (0.03165602684021)	19.16user 0.55system 0:19.73elapsed 99%CPU (0.0315179824829102)	40.70user 1.13system 0:41.85elapsed 99%CPU (0.0333621501922607)
2.1 parallel (one "file" time)	27.17user 11.83system 0:03.59elapsed 1086%CPU (0.367308855056763)	70.55user 41.25system 0:07.76elapsed 1440%CPU (0.430258989334106)	342.03user 164.20system 0:24.43elapsed 2071%CPU (1.46309399604797)	1786.75user 826.82system 1:47.61elapsed 2428%CPU (2.5661039352417)

We see that in single-threaded mode, 2.1 is slower than 1.14~1.16. And for 2.1, in parallel mode, the each "file" time increases as the total files increases, don't understand why it is related to the total files num, shouldn't it be same whether processing 100 files or 800 files?

Thanks for the testcase generator...quite helpful. I added a few more timers to the implementation to see if I could understand where time in going in your example.

Serial execution with lcov/2.* is slower than previous versions when generating very simple reports because of new features: the implementation is looking for additional information, branching around code that isn't used in those reports, checking for errors, etc. None of that is free. Based on your example, I made a few small changes to avoid some unnecessary callbacks...but they had a negligible effect on performance.

Parallel execution is not helping in your example for essentially the same reason/the same problem which motivated 40f4b046: when the child process has relatively little to do, then completed children can pile up, waiting in the queue to be reaped. The result is that their idle/wait time is longer than their process time. In that case, the fix was to batch multiple targets into each task passed to the child - so the child active time would be longer (more to do) and so there would be fewer children to merge back to the parent. Something similar could work here as well - but with somewhat higher complexity because we have a dependency tree rather than a simple list.

However: I don't have a strong motivation to implement that at present, as we almost always use more complicated reports - typically, annotated differential coverage reports - and the consequent additional child workload means that completed tasks don't linger on the queue. Your example behaves similarlly when I modify it to use some additional features (e.g., annotations). However: if somebody wants to take on improved parallelization - I'm happy to merge the implementation.

If this means that a particular job should use serial processing because parallel execution does not help: I think that is OK. In particular: in a regression environment, we are essentially generating the same report again and again - so it is worth a bit of effort, to benchmark performance periodically to find a good configuration. (Aside: 40f4b046 provides several hooks which can give better performance than the default - for exactly this reason.)

Similarly: if this means that some jobs should use an old lcov release: I think that that is OK too. Note that this won't be possible, if you need any features which are not present in the old release. Also note that the .info files generated by recent lcov releases may not be compatible with older releases - so mix-and-match might lead to grief. Two such incompatibilities are function end line numbers and exception branch markers.

My last question is what your goal/your application is. The title of this issue is genhtml is too slow compared to old version. What is your performance requirement? It seems that your actual coverage report generation runs in 1-2 minutes. Is that causing a bottleneck? Rather than trying to make genhtml go faster, it might be better to approach your actual problem from a different direction.

In my case, several .info files were processed one by one, and "--parallel" increased the overall time. Generating the coverage report took almost half of the total time. I noticed that older versions seemed to be faster, so I'm wondering if there was a bug. I may try to process the .info files simultaneously(each without parallelism), which should save time overall.

Does "processed one by one" mean "aggregation" - set union? Or a different operation? If "several" is (say) single digit and you aren't doing anything else (filtering, etc): I guess I'm not too surprised. I wouldn't expect parallel to be hugely slower - or hugely faster. If a larger number - and especially if a large number: then that is quite surprising, and I'm curious to see the data. That isn't what we observe - but our use model and project characteristics may be different than yours.

By "processed one by one" I mean:

genhtml -o output1 --branch --prefix prefix --parallel -- 1.info
genhtml -o output2 --branch --prefix prefix --parallel -- 2.info
genhtml -o output3 --branch --prefix prefix --parallel -- 3.info
...

I need to process total 6 .info files, the elapsed time is the sum of each .info processing time, which gives me about 276.68 seconds.

process the .info files simultaneously(each without parallelism):

genhtml -o output1 --branch --prefix prefix -- 1.info &
genhtml -o output2 --branch --prefix prefix -- 2.info &
genhtml -o output3 --branch --prefix prefix -- 3.info &
...

The elapsed time is determined by the longest processing time among all the .info files processed simultaneously, which gives me about 70.51 seconds.

linux-test-project / lcov

genhtml is too slow compared to old version #284