genhtml 2.2 memory usage regression

jschueller commented 4 days ago

hi,

the recent update to version 2.2 absolutely blows up my memory (I can see a quick 70%/32gb spike then my computer stops responding)

I'm using it to generate reports for a medium-large c++/Python library on an archlinux vm

reproducing it can requires a bit of time to run the test-suite:

FROM openturns/archlinux-base
MAINTAINER jschueller
WORKDIR /tmp
RUN sudo pacman -Syu swig lapack libxml2 mold lcov ceres-solver nlopt boost eigen --noconfirm
RUN git clone --depth 1 https://github.com/openturns/openturns.git
RUN cd openturns && cmake -DCMAKE_INSTALL_PREFIX=~/.local \
      -DCMAKE_UNITY_BUILD=ON -DCMAKE_UNITY_BUILD_BATCH_SIZE=32 \
      -DCMAKE_C_FLAGS="--coverage" -DCMAKE_CXX_FLAGS="--coverage -fuse-ld=mold" \
      -DSWIG_COMPILE_FLAGS="-O1" -DUSE_TBB=OFF .
RUN cd openturns && make OT -j16 && make install -j12
RUN cd openturns && ctest -E "cppcheck|Factory" -R "Distribution|Function" --output-on-failure -j12 --repeat after-timeout:2 --schedule-random
# RUN sudo pacman -S python swig cmake lapack libxml2 git --noconfirm
RUN cd openturns && gcov `find lib/src/ -name "*.gcno"`
RUN cd openturns && lcov --capture --directory . --output-file coverage.info --include "*.cxx"
RUN cd openturns && genhtml --output-directory coverage coverage.info

this is adapted from my actual CI scripts: https://github.com/openturns/openturns/blob/master/.ci_support/run_docker_coverage.sh

if I revert back to 2.1 the memory stays <6%:

sudo pacman -U https://archive.archlinux.org/packages/l/lcov/lcov-2.1-1-any.pkg.tar.zst --noconfirm

henry2cox commented 4 days ago

I guess it is the lcov --capture ... part where you see the problem. Can you run that with --profile - which will keep track of some performance statistics, so we can see what is happening. I would not expect to see systematically increased memory requirements. What class of machine are you using? How many cores, how much memory?

What is in your .lcovrc? Have you enabled --parallel execution? Are you using any throttling? How large is medium/large? (Amongst other projects, we build llvm - which is about 800K LOC, no idea how many files and not sure how many testsuites.)

jschueller commented 4 days ago

I double checked its definetely the genhtml command, and it seems this one has no --profile option
looking at the logs it seems to choke processing a very large generated swig source c++ file with 77k lines, thats probably a good hint
I'm using a linux vm with 4 cpu / 16 gb ram on remote but also tried locally with 16cpu / 32 gb ram which should be plenty
I dont have any config files everything is set from command line, no --parallel but I'm definetely checking this out if it can speed up the build
if you think llvm then its medium I guess (270k loc for the c++ lib part only)

henry2cox commented 4 days ago

Sorry - missed your genhtml .. command - was fixated on the capture above it.

Can you check which version of lcov you have installed - and specifically which genhtml you are getting? Recent ones all support --profile - so if yours complains about an unknown parameter, you definitely aren't getting the right one. It could be that you have a bad PERL5LIB and are picking up some very old lcovutil.pm - but I would be very surprised if that was happening and you didn't see all kinds of other issues.

77K LOC is a lot - but not unprecedented. I would not have expected a huge issue. (removed speculation about a possible issue related to tooltips...but tooltip is not generated without annotation - so that is not the issue.)

That said: I guess that this file is generated code, and I seriously doubt that you care about whether it is exercised or not (do you write or generate tests to test all of it?) If this is true, then you probably want to exclude it anyway - as that code is 25% of your project, and is just obscuring the numbers you actually care about.

henry2cox commented 3 days ago

To verify our speculation around the suspect large file, you could run genhtml twice:

genhtml --profile -o includeHuge --include '*hugefile.c' mydata.info
genhtml --profile -o excludeHuge --exclude '*hugefile.c mydata.info`

If the first runs badly and the second better...then we have confirmed our suspicion.

After confirming, if you run yet again after adding the -v flag to increase verbosity: the log might tell us more about what is happening/what is going wrong. Verbosity makes things slower - so might not be viable. It is also worth watching your HTML output directory - where that file's sourceview is written - to see whether the output is generated, is gowing, and what it appears to be doing.
If there is no source generated yet...then the issue is likely in some of the preprocessing steps.

Adding parallelism will somewhat complicate the debugging - as the order things get processed will not be deterministic - however, the child that croaks should be the huge file, assuming that our speculation is correct.

If this is an opensource/public project: it would likely help to attach a tarball of that source file and the captured .info for it. Then we could reproduce the issue (or not) and see about fixing it. If it is a proprietary project...would need to think of alternatives.

henry2cox commented 3 days ago

I generated a fake example and ran with my sandbox genhtml

$ /usr/bin/time -f '%E %M' ../../lcov/bin/genhtml -o x output.info --profile
Reading tracefile output.info.
Found 2 entries.
Found common filename prefix "/home/hcox/sample"
Generating output.
Processing file sample/main.c
  lines=1002 hit=1002 functions=1 hit=1
Processing file sample/src/func0.c
  lines=78000 hit=40000 functions=1000 hit=1000
Overall coverage rate:
  source files: 2
  lines.......: 51.9% (41002 of 79002 lines)
  functions...: 100.0% (1001 of 1001 functions)
Message summary:
  no messages were reported
0:11.44 906052 0

Seems not to not exhibit problems.

I think I need more information, to reproduce your issue.

jschueller commented 2 days ago

ok, I think I have a much quicker way to reproduce with a single generated swig file of 15k lines:

#!/bin/sh
# need to install cmake, swig, numpy
set -xe
cd /tmp
rm -rf nlopt
git clone --depth 1 https://github.com/stevengj/nlopt.git
cd nlopt
cmake -DCMAKE_INSTALL_PREFIX=$PWD/install \
      -DCMAKE_UNITY_BUILD=ON -DCMAKE_UNITY_BUILD_BATCH_SIZE=32 \
      -DCMAKE_C_FLAGS="--coverage" -DCMAKE_CXX_FLAGS="--coverage" \
      -DSWIG_COMPILE_FLAGS="-O1" -DNLOPT_TESTS=ON -DNLOPT_GUILE=OFF -DNLOPT_OCTAVE=OFF .
make install -j8
ctest -R "memoize|python" --output-on-failure -j12 --repeat after-timeout:2 --schedule-random
gcov `find src/ -name "*.gcno"`
echo "gcov OK"
lcov --capture --directory . --output-file coverage.info --include "*_wrap.cxx"
echo "lcov OK"
ulimit -Sv 100000
genhtml --profile --output-directory coverage coverage.info
echo "genhtml OK"

with 2.2 the ram goes above 100mb so the ulimit command here should make it fail

jschueller commented 2 days ago

seems it would come from 5f97fb44 which ironically tries to reduce memory footprint:

5f97fb4493a5f83b874fdc59d6e57a0bde31dde2 is the first bad commit
#14 160.9 commit 5f97fb4493a5f83b874fdc59d6e57a0bde31dde2
#14 160.9 Author: Henry Cox <henry.cox@mediatek.com>
#14 160.9 Date:   Mon Apr 15 17:18:44 2024 -0400
#14 160.9 
#14 160.9     Improved runtime peformance:
#14 160.9       - memory footprint: don't store redundant data
#14 160.9       - automate retry if child process killed due to out-of-memory
#14 160.9       - segmented HTML generation
#14 160.9     
#14 160.9     Signed-off-by: Henry Cox <henry.cox@mediatek.com>
#14 160.9 
#14 160.9  bin/genhtml                       | 1108 +++++++++++++++++++++++++------------
#14 160.9  bin/geninfo                       |  350 ++++++------
#14 160.9  bin/lcov                          |   14 +-
#14 160.9  bin/perl2lcov                     |   10 +-
#14 160.9  lib/lcovutil.pm                   |  795 ++++++++++++++------------
#14 160.9  man/genhtml.1                     |   18 +
#14 160.9  man/lcovrc.5                      |   23 +-
#14 160.9  scripts/spreadsheet.py            |   29 +-
#14 160.9  tests/Makefile                    |    2 +-
#14 160.9  tests/gendiffcov/simple/script.sh |   58 +-
#14 160.9  tests/lcov/extract/extract.sh     |    6 +-
#14 160.9  11 files changed, 1498 insertions(+), 915 deletions(-)
#14 160.9 bisect found first bad commit

the change looks pretty big so hard to see what's going on

henry2cox commented 2 days ago

Recipe worked up to ctest - then not. Perhaps pilot error, perhaps missing tool - but doesn't work.

hcox@HC-EL7-6040616:(master *):~/nlopt$ ctest -R 'memoize|python' --output-on-failure -j12 --repeast after-timeout:2 --schedule-random
Test project /home/hcox/nlopt
No tests were found!!!

jschueller commented 2 days ago

maybe swig or numpy headers are not installed, could you show the output of the cmake command ?

henry2cox commented 2 days ago

That was a good clue. Found a swig and loaded that. Now I find some tests, and some .gcda files get generated - and then it gets weird:

I was building with gcc/10.2.0
After running tests, gcov complains that the data found is for the wrong gcov version:

  /home/hcox/nlopt/CMakeFiles/nlopt.dir/src/util/mt19937ar.c.gcno:version '408R', prefer 'B02*'
geninfo: ERROR: (version) Incompatible GCC/GCOV version found while processing /home/hcox/nlopt/CMakeFiles/nlopt.dir/src/util/mt19937ar.c.gcda:
        Your test was built with '408R'.
        You are trying to capture with gcov tool '/mtkoss/gcc/10.2.0-rhel7/x86_64/bin/gcov' which is version '10.2'.

despite that I was using gcc/10.2.0, data was somehow generated for gcc/4.8.* How? Why?
- so I load a gcc/4.8.5 instead - so that gcov/4.8.5 is in my path - so that the gcov version I have matches the data that got generated. That seems to work:

Finished .info-file creation
Summary coverage rate:
  source files: 1
  lines.......: 13.6% (699 of 5156 lines)
  functions...: 18.0% (81 of 449 functions)

Number of lines looks much smaller than you suggested - but OK.

then run genhml under time -f "time %D mem %M" ...

Overall coverage rate:
  source files: 1
  lines.......: 13.6% (699 of 5156 lines)
  functions...: 18.0% (81 of 449 functions)
Message summary:
  112 warning messages:
    category: 11
    count: 1
    inconsistent: 100
  750 ignore messages:
    inconsistent: 750
time 0:01.91 mem 180488

which seems not excessive.

jschueller commented 2 days ago

the build should have generated a file named nloptPYTHON_wrap.cxx, else it would mean that python was not found, cmake output should look like this (depending on the versions):

-- Found Python: /usr/bin/python3.12 (found suitable version "3.12.7", minimum required is "3.6") found components: Interpreter Development.Module Development.SABIModule
-- Found NumPy: /usr/lib/python3.12/site-packages/numpy/_core/include (found version "2.1.3")
-- Found SWIG: /usr/bin/swig (found suitable version "4.2.1", minimum required is "3")

henry2cox commented 2 days ago

Yes. Seems to be generated. About 15K lines in my build.

If running exactly this example, exactly this way shows performance issues in your environment - then there is something different between us.

My version of lcov is a bit ahead of the upstream one - so it is possible that something got fixed, at some point.

I'm still (very) confused about why I seem to need to use the 4.8.5 version of gcov: that is very strange.

Which platform, compiler and version, and perl version are you using? (Mine is RHEL, gcc/10.2.0 - but I needed gcov from 4.8.5 for unknown reasons - and perl/5.22.0 None of this may be significant.)

jschueller commented 1 day ago

I'm using archlinux with gcc 14.2.1, perl 5.40

henry2cox commented 1 day ago

Can you append your c++ file and your.info - After verifying that those two do exhibit excessive memory consumption.

Then I will run exactly your data..and compare.

jschueller commented 19 hours ago

ok:

nloptPYTHON_wrap.cxx.txt

coverage.info.txt

linux-test-project / lcov

genhtml 2.2 memory usage regression #329