linux-test-project / lcov

LCOV
GNU General Public License v2.0
864 stars 234 forks source link

genhtml is too slow compared to old version #284

Open icsfy opened 2 months ago

icsfy commented 2 months ago

running on ubuntu20.04, with 16 cores, 32G memory. tested using the same tracefile(about 700 files, 300k+ lines, 20k+ functions, 100k+ branches), genhtml command:

time genhtml --output-directory $OUTPUT --branch-coverage --prefix $PREFIX coverage.info

with ubuntu20.04 default lcov 1.14:

real    0m23.972s
user    0m22.009s
sys     0m1.823s

with lcov 2.0:

real    1m14.681s
user    1m12.299s
sys     0m1.970s

with lcov 2.0(--parallel):

real    2m36.937s
user    21m53.142s
sys     14m31.093s

It seems that adding --parallel makes things worse, which consumes more CPU and memory, but have no advantage.(I thought parallelism should be faster?)

henry2cox commented 2 months ago

In general, parallel execution is faster if you have enough work that can be done independently, have sufficient resources to allow parallel execution without thrashing, and the overhead of parallelism is less than time taken by the particular task. It is definitely possible to create examples where one or more of these is violated.

Also, there is a --profile option - to ask the tool to tell you where it is spending time. The data may be helpful to figure out what is happening and/or to identify bottlenecks for optimization.

Probably worth to use current TOT, as there have been some performance optimizations since the 2.0 release. (Those optimizations are typiclly WRT our examples - which may or may not be representative of your examples.)

icsfy commented 2 months ago

Here is some summary info generated from genhtml.json(--profile output), LCOV version 2.1-beta2.3-g6141622

without --parallel option:

{"test1": {"overall": 69.693657875061, "html": {"cnt": 884, "min": 0.00220179557800293, "max": 0.715844869613647, "sum": 37.969685792922974, "avg": 0.042952133249912865}, "load": {"cnt": 699, "min": 0.000123023986816406, "max": 0.0527210235595703, "sum": 2.5745861530303955, "avg": 0.0036832419928904086}, "categorize": {"cnt": 699, "min": 0.000234842300415039, "max": 0.246188163757324, "sum": 8.29749059677124, "avg": 0.011870515875209214}, "dir": {"cnt": 184, "min": 0.00324487686157227, "max": 0.0333459377288818, "sum": 1.4431145191192625, "avg": 0.007843013690865557}, "file": {"cnt": 699, "min": 0.00352787971496582, "max": 1.24869990348816, "sum": 56.45982718467713, "avg": 0.08077228495661964}}, "test2": {"overall": 53.7329921722412, "html": {"cnt": 681, "min": 0.00214409828186035, "max": 0.708955049514771, "sum": 28.689658641815186, "avg": 0.04212872047256268}, "load": {"cnt": 533, "min": 0.00011897087097168, "max": 0.0503799915313721, "sum": 1.939279317855835, "avg": 0.003638422735189184}, "categorize": {"cnt": 533, "min": 0.000257968902587891, "max": 0.220337152481079, "sum": 6.480504274368286, "avg": 0.012158544604818548}, "dir": {"cnt": 147, "min": 0.00304913520812988, "max": 0.028817892074585, "sum": 1.0736997127532961, "avg": 0.007304079678593851}, "file": {"cnt": 533, "min": 0.00348591804504395, "max": 1.21855282783508, "sum": 42.94275641441345, "avg": 0.08056802329158246}}, "test3": {"overall": 28.3487250804901, "html": {"cnt": 379, "min": 0.00214195251464844, "max": 0.499402046203613, "sum": 15.58631682395935, "avg": 0.04112484650121201}, "load": {"cnt": 288, "min": 0.000129938125610352, "max": 0.0341091156005859, "sum": 1.0401902198791504, "avg": 0.0036117715968026053}, "categorize": {"cnt": 288, "min": 0.000262975692749023, "max": 0.122973203659058, "sum": 3.4219408035278325, "avg": 0.011881738901138307}, "dir": {"cnt": 90, "min": 0.00344610214233398, "max": 0.0301930904388428, "sum": 0.6219146251678466, "avg": 0.006910162501864962}, "file": {"cnt": 288, "min": 0.00318288803100586, "max": 0.779124021530151, "sum": 23.067707538604736, "avg": 0.08009620673126644}}, "test4": {"overall": 18.4163129329681, "html": {"cnt": 211, "min": 0.00220489501953125, "max": 0.738096952438354, "sum": 9.605504512786867, "avg": 0.04552371807007994}, "load": {"cnt": 166, "min": 0.000113964080810547, "max": 0.0519740581512451, "sum": 0.649959087371826, "avg": 0.003915416188986903}, "categorize": {"cnt": 166, "min": 0.000352144241333008, "max": 0.218640089035034, "sum": 2.2901556491851807, "avg": 0.013796118368585426}, "dir": {"cnt": 44, "min": 0.00253701210021973, "max": 0.0253028869628906, "sum": 0.33654904365539556, "avg": 0.00764884190125899}, "file": {"cnt": 166, "min": 0.0033719539642334, "max": 1.25581407546997, "sum": 14.650397777557371, "avg": 0.08825540829853838}}, "test5": {"overall": 8.58937692642212, "html": {"cnt": 127, "min": 0.00346803665161133, "max": 0.148995161056519, "sum": 4.690181016921997, "avg": 0.03693055918836218}, "load": {"cnt": 106, "min": 0.00026392936706543, "max": 0.0105841159820557, "sum": 0.32061481475830084, "avg": 0.003024668063757555}, "categorize": {"cnt": 106, "min": 0.000406026840209961, "max": 0.0425150394439697, "sum": 1.0230870246887207, "avg": 0.009651764383855855}, "dir": {"cnt": 20, "min": 0.00366497039794922, "max": 0.0290079116821289, "sum": 0.1894917488098145, "avg": 0.009474587440490725}, "file": {"cnt": 106, "min": 0.00561189651489258, "max": 0.248773097991943, "sum": 6.9603416919708225, "avg": 0.06566360086764927}}, "test6": {"overall": 7.45630288124084, "html": {"cnt": 98, "min": 0.00310301780700684, "max": 0.227667808532715, "sum": 4.2659528255462655, "avg": 0.043530130872921076}, "load": {"cnt": 76, "min": 0.000192880630493164, "max": 0.0157678127288818, "sum": 0.2864232063293456, "avg": 0.0037687263990703367}, "categorize": {"cnt": 76, "min": 0.000235080718994141, "max": 0.0471210479736328, "sum": 0.8142759799957274, "avg": 0.01071415763152273}, "dir": {"cnt": 21, "min": 0.0032801628112793, "max": 0.0259180068969727, "sum": 0.15669584274292, "avg": 0.007461706797281905}, "file": {"cnt": 76, "min": 0.00458002090454102, "max": 0.345099925994873, "sum": 6.142184734344482, "avg": 0.08081822018874318}}}

with --parallel option:

{"test1": {"overall": 139.099180936813, "html": {"cnt": 884, "min": 0.00317001342773438, "max": 0.79960298538208, "sum": 38.1168737411499, "avg": 0.04311863545378948}, "load": {"cnt": 699, "min": 0.000245094299316406, "max": 0.0875239372253418, "sum": 3.3733892440795894, "avg": 0.00482602180841143}, "categorize": {"cnt": 699, "min": 0.000663042068481445, "max": 0.232724905014038, "sum": 11.837536096572876, "avg": 0.01693495865031885}, "directory": {"cnt": 184, "min": 1.12537908554077, "max": 3.34291887283325, "sum": 444.03566646575933, "avg": 2.413237317748692}, "file": {"cnt": 699, "min": 1.06501793861389, "max": 3.74806904792786, "sum": 1511.8968088626862, "avg": 2.1629425019494795}, "child": {"cnt": 884, "min": 0.0118467807769775, "max": 1.38333296775818, "sum": 87.0812258720398, "avg": 0.09850817406339343}}, "test2": {"overall": 87.1324789524078, "html": {"cnt": 681, "min": 0.00324201583862305, "max": 0.801820993423462, "sum": 31.7707302570343, "avg": 0.04665305470930147}, "load": {"cnt": 533, "min": 0.000234127044677734, "max": 0.0558700561523438, "sum": 2.3206124305725098, "avg": 0.004353869475745797}, "categorize": {"cnt": 533, "min": 0.000843048095703125, "max": 0.249961853027344, "sum": 8.851559162139893, "avg": 0.016607052837035445}, "directory": {"cnt": 147, "min": 0.778333902359009, "max": 3.31218600273132, "sum": 280.12529325485224, "avg": 1.9056142398289269}, "file": {"cnt": 533, "min": 0.792023897171021, "max": 3.2999119758606, "sum": 893.5714342594147, "avg": 1.6764942481414908}, "child": {"cnt": 681, "min": 0.0102710723876953, "max": 1.3947548866272, "sum": 66.63386845588684, "avg": 0.097847090243593}}, "test3": {"overall": 27.4819920063019, "html": {"cnt": 379, "min": 0.0034639835357666, "max": 0.458012104034424, "sum": 16.648162126541138, "avg": 0.043926549146546534}, "load": {"cnt": 288, "min": 0.000232934951782227, "max": 0.0298130512237549, "sum": 1.1898078918457031, "avg": 0.0041312774022420245}, "categorize": {"cnt": 288, "min": 0.000811100006103516, "max": 0.153878927230835, "sum": 4.552825689315795, "avg": 0.01580842253234651}, "directory": {"cnt": 90, "min": 0.458359956741333, "max": 1.46185898780823, "sum": 86.64420795440677, "avg": 0.9627134217156308}, "file": {"cnt": 288, "min": 0.456877946853638, "max": 1.90998005867004, "sum": 264.52932047843933, "avg": 0.918504584994581}, "child": {"cnt": 379, "min": 0.00936508178710938, "max": 0.746864080429077, "sum": 32.218116521835334, "avg": 0.08500822301275814}}, "test4": {"overall": 12.3653080463409, "html": {"cnt": 211, "min": 0.00300908088684082, "max": 0.815544128417969, "sum": 9.816606521606444, "avg": 0.046524201524201156}, "load": {"cnt": 166, "min": 0.000207901000976562, "max": 0.0422279834747314, "sum": 0.664856195449829, "avg": 0.004005157803914632}, "categorize": {"cnt": 166, "min": 0.000793933868408203, "max": 0.219452857971191, "sum": 2.8899414539337154, "avg": 0.017409285867070575}, "directory": {"cnt": 44, "min": 0.32911205291748, "max": 1.22095799446106, "sum": 31.903474807739258, "avg": 0.7250789729031649}, "file": {"cnt": 166, "min": 0.310504913330078, "max": 2.15262699127197, "sum": 111.91055321693419, "avg": 0.6741599591381577}, "child": {"cnt": 211, "min": 0.00918817520141602, "max": 1.33370518684387, "sum": 18.484768152236935, "avg": 0.08760553626652576}}, "test5": {"overall": 5.10910105705261, "html": {"cnt": 127, "min": 0.00406289100646973, "max": 0.173205852508545, "sum": 5.150458574295043, "avg": 0.040554791923583015}, "load": {"cnt": 106, "min": 0.000497817993164062, "max": 0.0174529552459717, "sum": 0.38253927230834966, "avg": 0.0036088610595127325}, "categorize": {"cnt": 106, "min": 0.000803947448730469, "max": 0.0517008304595947, "sum": 1.4405694007873533, "avg": 0.013590277365918427}, "directory": {"cnt": 20, "min": 0.178601980209351, "max": 0.47684907913208, "sum": 7.307308197021484, "avg": 0.36536540985107424}, "file": {"cnt": 106, "min": 0.158262968063354, "max": 0.692163944244385, "sum": 38.51706671714783, "avg": 0.3633685539353569}, "child": {"cnt": 127, "min": 0.00975990295410156, "max": 0.29651403427124, "sum": 9.54466986656189, "avg": 0.07515488083907}}, "test6": {"overall": 3.60685014724731, "html": {"cnt": 98, "min": 0.00437092781066895, "max": 0.213585138320923, "sum": 4.359879493713379, "avg": 0.044488566262381415}, "load": {"cnt": 76, "min": 0.000296115875244141, "max": 0.0146920680999756, "sum": 0.2980520725250243, "avg": 0.003921737796381899}, "categorize": {"cnt": 76, "min": 0.00069880485534668, "max": 0.0452780723571777, "sum": 1.024108409881592, "avg": 0.013475110656336737}, "directory": {"cnt": 21, "min": 0.151760101318359, "max": 0.423412084579468, "sum": 6.200369358062746, "avg": 0.2952556837172736}, "file": {"cnt": 76, "min": 0.127248048782349, "max": 0.550164937973022, "sum": 23.31259274482727, "avg": 0.3067446413793062}, "child": {"cnt": 98, "min": 0.0101768970489502, "max": 0.323693990707397, "sum": 7.506403207778928, "avg": 0.07659595109978498}}}

It seems that file and directory increase a lot when --parallel enabled. As the total number of files increases, the processing time per file seems to increase as well.

henry2cox commented 2 months ago

I fear, not helpful. If you want me to take a look at the data: please attach the full .json files. If you feel that there is some proprietary information in that data: then munge pathnames to something meaningless (but make sure you retain the full subdirectory structure). You can also use the spreadsheet.py tool from the scripts directory, to turn the profile into an excel spreadsheet, which may be easier to analyze.

If 'file' times are going up: that suggests that your filesystem or network cannot keep up with the demand, and is a bottleneck. Locally: that often means that you need to copy data to compute farm-local (fast) storage - rather than trying to retrieve it from some far-away network share.

icsfy commented 2 months ago

All files are processed locally rather than retrieved from the network. As you concerned, disk storage might be slow, so this time the output directory was mounted as tmpfs, but that didn't help.

.json files: with_parallel.json without_parallel.json

By the way spreadsheet.py doesn't seem to work? As the output stats.xlsx seems not have something helpful.

henry2cox commented 2 months ago

The data suggests that you are running out of resources/thrashing. We see that the time spent executing in the child process is short (0.11s average, 0.16 std. dev, total of 95.48s), whereas the fork-to-reap time is much longer (2.22s average, 0.66 std. dev - whopping 1963s sum). I suggest to set a max memory threshold - say, 85% of total - and see if throttling helps.

Note that we do expect the 'child' processing time to be longer than the single-threaded processing time for the same file or directory object. This reflects the time to fork the process, to serialize/deserialize data, etc. When we have enough work to do and the tasks are sufficiently expensive, then the advantage we see from parallel execution makes up for the additional overhead. When we don't have much to do...it doesn't.

It is also interesting that your data seems to show that lcov 2.0/2.1 is slower than 1.14 in single-threaded mode. I wonder if that is still true, if you enable branch coverage. I had seen that 2.0 was about 2-3X faster than lcov/1.15 - but I haven't tried to benchmark in quite a long time - and older versions would not work in our environment in any case.

icsfy commented 2 months ago

Have written a script to generate sample project, the results may be reproducible. gen_sample_proj.tar.gz

# To generate sample project with 800 files, 100 functions per file
./gen_sample_proj.py 800 100
# and then run
cd sample_800_100 && ./build.sh
Results tested on another system with 32cores, 64G mem. I looked at the resource usage and that the memory is definitely enough: LCOV version 100 files
100 funcs/file
200 files
100 funcs/file
400 files
100 funcs/file
800 files
100 funcs/file
1.14 1.14user
0.03system
0:01.18elapsed
99%CPU
2.15user
0.12system
0:02.28elapsed
100%CPU
4.44user
0.22system
0:04.67elapsed
99%CPU
9.31user
0.55system
0:09.88elapsed
99%CPU
1.15 1.11user
0.05system
0:01.17elapsed
99%CPU
2.14user
0.14system
0:02.28elapsed
99%CPU
4.28user
0.23system
0:04.51elapsed
99%CPU
8.89user
0.51system
0:09.42elapsed
99%CPU
1.16 1.08user
0.07system
0:01.15elapsed
99%CPU
2.12user
0.11system
0:02.24elapsed
99%CPU
4.28user
0.16system
0:04.44elapsed
99%CPU
8.89user
0.41system
0:09.31elapsed
99%CPU
2.1
single-threaded

(one "file" time)
4.91user
0.17system
0:05.09elapsed
99%CPU
(0.0318551063537598)
9.84user
0.25system
0:10.10elapsed
99%CPU
(0.03165602684021)
19.16user
0.55system
0:19.73elapsed
99%CPU
(0.0315179824829102)
40.70user
1.13system
0:41.85elapsed
99%CPU
(0.0333621501922607)
2.1
parallel

(one "file" time)
27.17user
11.83system
0:03.59elapsed
1086%CPU
(0.367308855056763)
70.55user
41.25system
0:07.76elapsed
1440%CPU
(0.430258989334106)
342.03user
164.20system
0:24.43elapsed
2071%CPU
(1.46309399604797)
1786.75user
826.82system
1:47.61elapsed
2428%CPU
(2.5661039352417)

We see that in single-threaded mode, 2.1 is slower than 1.14~1.16. And for 2.1, in parallel mode, the each "file" time increases as the total files increases, don't understand why it is related to the total files num, shouldn't it be same whether processing 100 files or 800 files?

henry2cox commented 2 months ago

Thanks for the testcase generator...quite helpful. I added a few more timers to the implementation to see if I could understand where time in going in your example.

Serial execution with lcov/2.* is slower than previous versions when generating very simple reports because of new features: the implementation is looking for additional information, branching around code that isn't used in those reports, checking for errors, etc. None of that is free. Based on your example, I made a few small changes to avoid some unnecessary callbacks...but they had a negligible effect on performance.

Parallel execution is not helping in your example for essentially the same reason/the same problem which motivated 40f4b046: when the child process has relatively little to do, then completed children can pile up, waiting in the queue to be reaped. The result is that their idle/wait time is longer than their process time. In that case, the fix was to batch multiple targets into each task passed to the child - so the child active time would be longer (more to do) and so there would be fewer children to merge back to the parent. Something similar could work here as well - but with somewhat higher complexity because we have a dependency tree rather than a simple list.

However: I don't have a strong motivation to implement that at present, as we almost always use more complicated reports - typically, annotated differential coverage reports - and the consequent additional child workload means that completed tasks don't linger on the queue. Your example behaves similarlly when I modify it to use some additional features (e.g., annotations). However: if somebody wants to take on improved parallelization - I'm happy to merge the implementation.

If this means that a particular job should use serial processing because parallel execution does not help: I think that is OK. In particular: in a regression environment, we are essentially generating the same report again and again - so it is worth a bit of effort, to benchmark performance periodically to find a good configuration. (Aside: 40f4b046 provides several hooks which can give better performance than the default - for exactly this reason.)

Similarly: if this means that some jobs should use an old lcov release: I think that that is OK too. Note that this won't be possible, if you need any features which are not present in the old release. Also note that the .info files generated by recent lcov releases may not be compatible with older releases - so mix-and-match might lead to grief. Two such incompatibilities are function end line numbers and exception branch markers.

My last question is what your goal/your application is. The title of this issue is genhtml is too slow compared to old version. What is your performance requirement? It seems that your actual coverage report generation runs in 1-2 minutes. Is that causing a bottleneck? Rather than trying to make genhtml go faster, it might be better to approach your actual problem from a different direction.

icsfy commented 2 months ago

In my case, several .info files were processed one by one, and "--parallel" increased the overall time. Generating the coverage report took almost half of the total time. I noticed that older versions seemed to be faster, so I'm wondering if there was a bug. I may try to process the .info files simultaneously(each without parallelism), which should save time overall.

henry2cox commented 2 months ago

Does "processed one by one" mean "aggregation" - set union? Or a different operation? If "several" is (say) single digit and you aren't doing anything else (filtering, etc): I guess I'm not too surprised. I wouldn't expect parallel to be hugely slower - or hugely faster. If a larger number - and especially if a large number: then that is quite surprising, and I'm curious to see the data. That isn't what we observe - but our use model and project characteristics may be different than yours.

icsfy commented 2 months ago

By "processed one by one" I mean:

genhtml -o output1 --branch --prefix prefix --parallel -- 1.info
genhtml -o output2 --branch --prefix prefix --parallel -- 2.info
genhtml -o output3 --branch --prefix prefix --parallel -- 3.info
...

I need to process total 6 .info files, the elapsed time is the sum of each .info processing time, which gives me about 276.68 seconds.

process the .info files simultaneously(each without parallelism):

genhtml -o output1 --branch --prefix prefix -- 1.info &
genhtml -o output2 --branch --prefix prefix -- 2.info &
genhtml -o output3 --branch --prefix prefix -- 3.info &
...

The elapsed time is determined by the longest processing time among all the .info files processed simultaneously, which gives me about 70.51 seconds.