Parallel processing throughput keeps decreasing

aallrd commented 1 year ago

Hello,

I am trying to use lcov to process a lot of coverage data from a big project. The server I am using has 56 cores available, I am using --parallel 12 and I see 12 geninfo child processes running as expected. I would expect constant files/s processing but I observe the reported throughput on the stdout going down as time passes, and the remaining time going up.

$ lcov --version
lcov: LCOV version 2.0-1
$ geninfo --version
geninfo: LCOV version 2.0-1
$ lcov \
  --parallel 12 \
  --test-name 639717 \
  --rc geninfo_unexecuted_blocks=1 \
  --output-file coverage.639717.info \
  --gcov-tool $(pwd)/gcov \
  --capture \
  --build-directory SOURCES \
  --directory COVERAGE/639717
Capturing coverage data from COVERAGE/639717
geninfo cmd: 'geninfo COVERAGE/639717 --output-filename coverage.639717.info --test-name 639717 --build-directory SOURCES --gcov-tool /opt/bin/gcov --rc geninfo_unexecuted_blocks=1 --parallel 12 --memory 0'
Found gcov version: 11.2.1
Using intermediate gcov format
Writing temporary data to /tmp/geninfo_datVndA
Scanning COVERAGE/639717 for .gcda files ...
Found 44782 data files in COVERAGE/639717
files remaining: 44582 (101.31 files/s - predict 7.33 minutes remaining)
files remaining: 44382 (74.21 files/s - predict 9.97 minutes remaining)
files remaining: 44182 (60.64 files/s - predict 12.14 minutes remaining)
files remaining: 43982 (53.95 files/s - predict 13.59 minutes remaining)
files remaining: 43782 (49.13 files/s - predict 14.85 minutes remaining)
files remaining: 43582 (45.17 files/s - predict 16.08 minutes remaining)
files remaining: 43382 (41.93 files/s - predict 17.24 minutes remaining)
[...]
files remaining: 39382 (15.89 files/s - predict 41.31 minutes remaining)
[...]

I am not seeing any resource issue / contention on the server. Do you have an idea why the lcov throughput is going down?

Thank you.

henry2cox commented 1 year ago

One thing that is likely to be costing time is that you are using the "testname' feature. This results in a constantly growing DB because the tool has to hold (unique) data for each testcase. If you need to know what test hit which code: then you are kinda screwed - and just have to bite the bullet. You will see the master process size increase steadily - which may ultimately also limit parallelism. If you don't need to keep track of individual test coverage and only want the aggregate, then the --forget-test-names feature might help you.

The other thing you probably want to do is to add --profile to your command line, then take a look at the performance spreadsheet to see if lcov/geninfo is doing anything that is obviously stupid.

henry2cox commented 1 year ago

Incidentally: if you have 56 cores - you might want to use --parallel 0 (to let the tool decide how many cores to use) and --memory someReasonableBound (to give it hints about how to throttle to avoid thrashing). That might help to go faster. We see near linear performance improvement up to close to the number of available cores. YMMV - especially if you get into thrashing range.

aallrd commented 1 year ago

I removed --test-name and added --forget-test-names, the throughput was still going down. It seems I cannot set a memory value, is the below command valid?

$ lcov \
  --parallel 0 \
  --memory 2000 \
  --forget-test-names \
  --rc geninfo_unexecuted_blocks=1 \
  --output-file coverage.639717.info \
  --gcov-tool $(pwd)/gcov \
  --capture \
  --build-directory SOURCES \
  --directory COVERAGE/639717
Capturing coverage data from COVERAGE/639717
geninfo cmd: 'geninfo COVERAGE/639717 --output-filename coverage.639717.info --build-directory SOURCES --gcov-tool /opt/bin/gcov --rc geninfo_unexecuted_blocks=1 --parallel 0 --memory 0.0019073486328125'
geninfo: WARNING: Value "0.0019073486328125" invalid for option memory (number expected)
Use geninfo --help to get usage information

henry2cox commented 1 year ago

Damn. Yes - your command is correct..but you hit a bug in my code :-(

In the process of fixing. In the meantime, the workaround is to multiply the limit you actually want by 2^20.

Sigh. He who makes no mistakes, makes nothing at all.

I'm also a bit skeptical that dropping test names is going to help...that issue is mostly about aggregation, not capture. Could be something else in the code. Unfortunately, there is not a lot of debug/diagnostic print in this area to help us debug. Almost have to wait for process to complete and look at shrapnel. When this happens internally: I just run stuff in the debugger to see what is happening. Can you monitor process size of the master and of the slaves? Will help to know if they are roughly constant size or increasing, and if their total CPU usage is roughly constant or increasing. (Profile will tell us that...but not until the end.)

aallrd commented 1 year ago

I have reduced the number of files to be able to reach the end of the process. I tried your workaround for the memory and managed to set it to 4000M, but the throughput is worse and still decreasing:

$ lcov \
  --parallel 0 \
  --memory 4194304000 \
  --forget-test-names \
  --rc geninfo_unexecuted_blocks=1 \
  --output-file coverage.639717.info \
  --gcov-tool /opt/bin/gcov \
  --capture \
  --build-directory SOURCES \
  --directory COVERAGE/639717
Capturing coverage data from COVERAGE/639717
geninfo cmd: 'geninfo COVERAGE/639717 --output-filename coverage.639717.info --build-directory SOURCES --gcov-tool /opt/bin/gcov --rc geninfo_unexecuted_blocks=1 --parallel 0 --memory 4000'
Found gcov version: 11.2.1
Using intermediate gcov format
Writing temporary data to /tmp/geninfo_datAs4M
Scanning COVERAGE/639717 for .gcda files ...
Found 2894 data files in COVERAGE/639717
files remaining: 2694 (20.38 files/s - predict 2.20 minutes remaining)
files remaining: 2494 (9.35 files/s - predict 4.45 minutes remaining)
files remaining: 2294 (9.63 files/s - predict 3.97 minutes remaining)
files remaining: 2094 (9.29 files/s - predict 3.76 minutes remaining)
files remaining: 1894 (8.72 files/s - predict 3.62 minutes remaining)
files remaining: 1694 (8.12 files/s - predict 3.48 minutes remaining)
files remaining: 1494 (7.54 files/s - predict 3.30 minutes remaining)
[...]

I am trying to get the profile you requested but it seems I have an issue of my own with source paths and the process fails with an error preventing I guess the report from being generated:

$ lcov \
  --parallel 0 \
  --profile profile.json \
  --forget-test-names \
  --rc geninfo_unexecuted_blocks=1 \
  --output-file coverage.639717.info \
  --gcov-tool /opt/bin/gcov \
  --capture \
  --build-directory SOURCES  \
  --directory COVERAGE/639717
Capturing coverage data from COVERAGE/639717
geninfo cmd: 'geninfo COVERAGE/639717 --output-filename coverage.639717.info --build-directory SOURCES --gcov-tool /opt/bin/gcov --rc geninfo_unexecuted_blocks=1 --profile profile.json --parallel 0 --memory 0'
Found gcov version: 11.2.1
Using intermediate gcov format
Writing temporary data to /tmp/geninfo_datTx2r
Scanning COVERAGE/639717 for .gcda files ...
Found 2894 data files in COVERAGE/639717
files remaining: 2694 (127.46 files/s - predict 0.35 minutes remaining)
files remaining: 2494 (110.98 files/s - predict 0.37 minutes remaining)
files remaining: 2294 (90.99 files/s - predict 0.42 minutes remaining)
files remaining: 2094 (80.45 files/s - predict 0.43 minutes remaining)
files remaining: 1894 (72.26 files/s - predict 0.44 minutes remaining)
files remaining: 1694 (67.84 files/s - predict 0.42 minutes remaining)
files remaining: 1494 (64.25 files/s - predict 0.39 minutes remaining)
files remaining: 1294 (61.36 files/s - predict 0.35 minutes remaining)
files remaining: 1094 (55.86 files/s - predict 0.33 minutes remaining)
files remaining: 894 (53.68 files/s - predict 0.28 minutes remaining)
files remaining: 694 (50.69 files/s - predict 0.23 minutes remaining)
files remaining: 494 (48.71 files/s - predict 0.17 minutes remaining)
files remaining: 294 (46.41 files/s - predict 0.11 minutes remaining)
files remaining: 94 (44.53 files/s - predict 0.04 minutes remaining)
geninfo: ERROR: geninfo: ERROR: geninfo: ERROR: geninfo: ERROR: unable to open /proc/self/cwd/cpp/source.cpp: No such file or directory
        (use "geninfo --ignore-errors source ..." to bypass this error)

What can I do to remove/replace the prefix /proc/self/cwd from the source file path? I have seen a --substitute REGEXP option but I don't know how to use it, or if it would be useful.

The resulting profile.json content from the above failing command just in case:

{
  "config": {
    "uname": "Linux foo 4.18.0-372.9.1.el8.x86_64 #1 SMP Fri Apr 15 22:12:19 EDT 2022 x86_64 x86_64 x86_64 GNU/Linux",
    "tool_dir": "/usr/bin",
    "bin": "/usr/bin",
    "date": "Wed Jul 12 17:55:19 CEST 2023",
    "version": "LCOV version 2.0-1",
    "cmdLine": "geninfo COVERAGE/639717 --output-filename coverage.639717.info --build-directory SOURCES --gcov-tool /opt/bin/gcov --rc geninfo_unexecuted_blocks=1 --substitute /proc/self/cwd --profile profile.json --parallel 0 --memory 0",
    "cores": 56,
    "maxParallel": 56,
    "tool": "geninfo",
    "url": "https://github.com//linux-test-project/lcov"
  },
  "total": 0.0400650501251221,
  "find": {
    "COVERAGE/639717": 0.0244309902191162
  },
  "gen_info": {
    "COVERAGE/639717": 0.0268959999084473
  }
}

henry2cox commented 1 year ago

You can add --keep-going to the command line - so that the tool will ignore all the issues that it can (you may still hit a fatal error that cannot be overlooked).

You are correct WRT pathname munging. If the paths that got encoded into your coverage data (during build) are not what you want/need - then you can use --substitute regexp to change them (see the man page for examples - but it is basically Perl syntax regexps). Fundamentally, this is no different than just running sed on the data after you generated it. However, if the tool is looking for source code so that it can do some filtering, then you either need to use --substitute (so the paths are correct and the sources can be found) or to skip filtering in this step and do it later (after sed).

Profile looks wrong/truncated - probably due to the error that was not skipped.

henry2cox commented 1 year ago

Fix for the --memory issue in fe6cbb884094 (Performance issue - if any - is not resolved.)

aallrd commented 1 year ago

I manages to get the profile.json:

$ lcov \
  --parallel 0 \
  --profile profile.json \
  --capture \
  --gcov-tool $(which gcov) \
  --forget-test-names \
  --rc geninfo_unexecuted_blocks=1 \
  --output-file coverage.639717.info \
  --build-directory SOURCES   \
  --directory COVERAGE/639717   \
  --substitute 's#/proc/self/cwd/#SOURCES/#g' \
  --exclude '/opt/*' \
  --exclude '/usr/*' \
  --keep-going
Found gcov version: 11.2.1
Using intermediate gcov format
Writing temporary data to /tmp/geninfo_datcWmu
Scanning COVERAGE/639717 for .gcda files ...
Found 2894 data files in COVERAGE/639717
files remaining: 2694 (197.62 files/s - predict 0.23 minutes remaining)
files remaining: 2494 (169.77 files/s - predict 0.24 minutes remaining)
files remaining: 2294 (140.87 files/s - predict 0.27 minutes remaining)
files remaining: 2094 (123.27 files/s - predict 0.28 minutes remaining)
files remaining: 1894 (116.29 files/s - predict 0.27 minutes remaining)
files remaining: 1694 (92.70 files/s - predict 0.30 minutes remaining)
files remaining: 1494 (87.59 files/s - predict 0.28 minutes remaining)
files remaining: 1294 (82.76 files/s - predict 0.26 minutes remaining)
files remaining: 1094 (75.72 files/s - predict 0.24 minutes remaining)
files remaining: 894 (70.73 files/s - predict 0.21 minutes remaining)
files remaining: 694 (65.40 files/s - predict 0.18 minutes remaining)
files remaining: 494 (62.14 files/s - predict 0.13 minutes remaining)
files remaining: 294 (58.00 files/s - predict 0.08 minutes remaining)
files remaining: 94 (53.88 files/s - predict 0.03 minutes remaining)
Excluded data for 118 files due to include/exclude options
geninfo: WARNING: Error: unknown line '70' in SOURCES/lib/awesome/file.cpp: there are only 68 lines in file.
Finished .info-file creation

The profile is a really big file with many keys, do you have hints at what I should look at first?

$ jq keys < profile.json
[
  "append",
  "child",
  "config",
  "data",
  "emit",
  "exec",
  "find",
  "gen_info",
  "merge",
  "parse",
  "process",
  "total",
  "undump"
]

henry2cox commented 1 year ago

One possibility (that may not be doable, depending on your company policies) - is to mail me a compressed version of the file. It isn't too hard to deduce a good address.

The other thing to do if you have not already is to generate a spreadsheet so you can see some basic statistics: $LCOV_HOME/share/lcov/support-scripts/spreadsheet.py -o foo.xlsx profile.json

One of the basic questions is whether the distribution of CPU times and memory is relatively uniform or if there are substantial outliers - then to figure out why the outliers exist. Another basic question is whether some particular step is taking more time or memory than we expect.

Have you been able to extract from your full set of files..and how long did that take (wall clock) compared to this subset? Are there any unusual features of that data that we don't see in the smaller subset?

You may already be doing it, but for large examples, we typically fragment two ways: split into N subsets and send each of those to the compute farm - then aggregate the result at the end. One particular job went from about 17 hours to just under 20 minutes (combination of tricks + at least one or two bug fixes). YMMV.

aallrd commented 1 year ago

I have managed to "anonymize" the reports so I can attach them here: anonymized-reports.tar.gz These files are from the above subset run on 2894 files, where we can already see the throughput decreasing over time. I'll try to get the same reports with with our target 44782 files during the night, and measure how long it takes.

If I understood your trick, it would mean:

splitting my directory COVERAGE/639717 containing all the gcda files into several directories COVERAGE/639717-X
run as many lcov capture commands as X with --build-directory SOURCES --directory COVERAGE/639717-X
run as many lcov add-tracefile commands as X on the collection of coverage.639717.X.infofiles

Does that sound right?

henry2cox commented 1 year ago

Right - except that the last step is to run lcov --add once, to merge all the .info files generated in step 2. (You can use glob matching for this...one of the reasons that that option exists.)

The spreadsheet seems to indicate that it took about 80s wall clock time to process about 2900 files. Observed parallelism speedup was ~27 on your 56-core machine. You didn't enable memory throttling (just an observation - I have no reason to think that enabling or disabling would have any effect). It takes 1.1s to for a slave to process 1 file on average; stddev is 0.65 (so pretty high - not sure why) Does this look correct?

I don't collect timestamps - so I can't easily see whether performance degrades as we go along. (But I do have a small patch which emits a bit more information in the interval callback, which might help.)

aallrd commented 1 year ago

The full run on 44782 files was not able to complete and seems stuck, it is still "running" after ~12h.

$ date
Thu Jul 13 20:55:02 CEST 2023
$ lcov [...] &
Capturing coverage data from COVERAGE/639717
geninfo cmd: 'geninfo COVERAGE/639717 --output-filename coverage.639717.info --build-directory SOURCES --gcov-tool /opt/bin/gcov --keep-going --rc geninfo_unexecuted_blocks=1 --exclude /opt/* --exclude /usr/* --substitute s#/proc/self/cwd/#SOURCES/#g --profile profile.json --parallel 0 --memory 0'
Found gcov version: 11.2.1
Using intermediate gcov format
Writing temporary data to /tmp/geninfo_dat6s1X
Scanning COVERAGE/639717 for .gcda files ...
Found 44782 data files in COVERAGE/639717
files remaining: 44582 (127.36 files/s - predict 5.83 minutes remaining)
files remaining: 44382 (114.02 files/s - predict 6.49 minutes remaining)
[...]
files remaining: 3782 (0.99 files/s - predict 63.66 minutes remaining)
files remaining: 3582 (0.99 files/s - predict 60.18 minutes remaining)
files remaining: 3382 (0.99 files/s - predict 56.71 minutes remaining)
files remaining: 3182 (0.99 files/s - predict 53.33 minutes remaining)
$ date
Fri Jul 14 09:17:34 CEST 2023

The top output shows:

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
466502 foo       20   0    0.1g   0.0g   0.0g S   0.0   0.0   0:00.42 lcov
466534 foo       20   0    3.3g   0.4g   0.0g S   0.0   0.1 145:09.77 geninfo
188253 foo       20   0    3.3g   2.6g   0.0g D   2.3   0.7   0:25.43 geninfo
188419 foo       20   0    3.3g   2.6g   0.0g D   2.6   0.7   0:25.37 geninfo
188456 foo       20   0    3.3g   2.6g   0.0g D   2.3   0.7   0:25.31 geninfo
[+ ~50 others]

In the beginning of the run the geninfo processes used to take ~400MB in RES memory before completing in a few seconds, as time passes they seem to take longer and consume way more memory. Also, the X geninfo processes seem to be called with the same args as the lcov command, is that expected for these processing commands?

UID         PID   PPID  C STIME TTY          TIME CMD
foo      306717 466534  3 09:27 ?        00:00:00 geninfo COVERAGE/639717 --output-filename coverage.639717.info --build-directory SOURCES --gcov-tool /opt/bin/gcov --keep-going --rc geninfo_unexecuted_blocks=1 --exclude /opt/* --exclude /usr/* --substitute s#/proc/self/cwd/#SOURCES/#g --profile profile.json --parallel 0 --memory 0

The server activity seems okay, I cannot see any contention on I/O, CPU or memory.

I'll try to split in batches as you suggested to see if I can get it to complete.

I was not expecting this coverage processing operation to be so long and resource greedy, but I have no reference. Would it be possible for you to share data from jobs that you have seen successfully processed, like number of files / elapsed time / resources ? I would like to get an idea if I can expect this to complete in a few minutes or if hours will always be required.

henry2cox commented 1 year ago

Your data is somewhat unexpected - but I am not sure if it is a bug or just a large data set. Execution seems rather different than other projects that I'm familiar with. In particular, the small experiment you ran yesterday appeared to process 4K files in about 1.5 minutes. Naively, I would expect to see your 40K files take about 15 minutes - and specifically, the remaining 3K files we see in your list above to also take about another minute - but that is not what what we observe.

Would be good to understand why this happens - and to fix it, if it is a bug.

Not sure whether it matters or not but:

how many source files in your project/how many compilation units? This will be the same as the number of .gcno files.
do we have 40K .gcda files because you have (say) 10 individual data sets, each containing data for 4K compilation units? Or are there really 40K compilation units and only one test suite/one set of merged runtime coverage data?
you are using gcc 11
there is nothing particularly strange or funky about your project/your code - as far as we know.

It is possible that the number of compilation units is affecting performance - but I am not entirely sure why that would be. I have a speculation :-) - but I would rather not look foolish until we have more data.

I just re-ran an experiment to extract coverage data from the LLVM source code (llvm 11 + some local changes). I don't have data for any of the larger projects at hand - and probably could not share, even if I did (sorry).

2058 files
22 core/32Gb linux VM
180.79s total time (start to finish). (Ran it about 5 minutes ago and got 155.21s; in this case, I attribute the difference to inconsistent notwork and filesystem performance).
average time per file: 1.98s, stddev 0.81
effective parallelism: 14.29 (saw 16.11 in the other attempt)

Found LLVM gcov version 11.0.0, which emulates gcov version 4.2.0 Using intermediate gcov format Writing temporary data to /tmp/geninfo_datU75C Scanning . for .gcda files ... Found 2058 data files in . files:1838 elapsed:0.1m predict 0.7m to go: 44.99 files/s 0.12 s/file (interval:44.99 f/s 0.12 s/f) files:1618 elapsed:0.3m predict 1.1m to go: 24.89 files/s 0.29 s/file (interval:17.20 f/s 0.46 s/f) files:1398 elapsed:0.6m predict 1.2m to go: 19.22 files/s 0.47 s/file (interval:13.20 f/s 0.84 s/f) files:1178 elapsed:0.9m predict 1.2m to go: 16.10 files/s 0.68 s/file (interval:10.84 f/s 1.29 s/f) files:958 elapsed:1.3m predict 1.1m to go: 14.54 files/s 0.90 s/file (interval:10.47 f/s 1.79 s/f) files:738 elapsed:1.7m predict 0.9m to go: 13.26 files/s 1.09 s/file (interval:9.22 f/s 2.07 s/f) files:518 elapsed:2.1m predict 0.7m to go: 12.20 files/s 1.26 s/file (interval:8.22 f/s 2.29 s/f) Finished .info-file creation

Predicted completion time is computed from throughput seen to date. Also printing throughput seen in each print interval so that we can see whether it changing. Interval length is now set such that each thread is expected to process ~10 files. The above data shows the performance degrading - but the earlier one did not (and the last 2 intervals were actually slightly faster than any of the others).

There are some more experiments and some additional things we can do to try to figure out what is happening with your job. You may or may not want to share that data, though - and you may or may not want to contact (more) privately.

aallrd commented 1 year ago

Here are some details about the files I am trying to process. I have a SOURCES folder containing the source code of the application and the associated gcno files:

SOURCES/lib/awesome/file.cpp
SOURCES/output/lib/awesome/o/awesome/file.gcno

When executing my application, the folder COVERAGE/<PID> is created containing the gcda files:

COVERAGE/<PID>/output/lib/awesome/o/awesome/file.gcda

In both these folders there are:

C++ source files: 172021
gcno files: 52372
gcda files: 44782

All these gcda files were generated from a single execution of the application (single \<PID> folder) with dummy args (like calling --version).

The full lcov run started yesterday was finally able to complete in ~15h, the final full.coverage.info file is 423MB. I am attaching the reports here: anonymized-full-reports.tar.gz.

I took up your suggestion and divided the COVERAGE/<PID> directory into N COVERAGE/<PID>-N folders of 2000 files each. I ran the same lcov command as before, sequentially on each of these directories, the full run completed in ~13min. Merging the files took then 3min and the resulting merged.coverage.info file is 422MB.

  add=()
  for f in output/coverage.*.info; do
    add+=(--add "${f}")
  done
  lcov --output-file merged.coverage.info --ignore-errors format ${add[@]}

So it seems to match exactly what you described in https://github.com/linux-test-project/lcov/issues/228#issuecomment-1633938044. I guess it means there is some kind of issue in lcov with handling a lot of files in a single run.

henry2cox commented 1 year ago

Quick clarification:

when you fragment into ~20 chunks of 2000 elements each, did you process them sequentially (on one host) is 13 minutes total, OR
did you send the fragments off to N machines in the compute farm (to execute in parallel) - and got a result back in 13 minutes?

If the former case is what you observe, then it should be straightforward for the tool to fragment in exactly the same way - and improve performance quite markedly. If the latter: then an external script to dispatch to the farm is needed.

Either way: this does look like a bug that I am not entirely sure how to diagnose or fix. I would expect monolithic and fragmented sequential performance to be rather similar to each other.

When you get further down the road, I am curious what you see WRT coverage categorization, binning, etc. - but I suspect that you cannot tell me. I doubt that a single HTML report for 170K files is going to be terribly useful or usable, though. Presumably, the project consists of multiple subsystems and modules such that a split report for each of the various teams would be what is desired.

As an aside: if you name the output files to end in .gz ("anonomized-full.info.gz") - then the data will be compressed. That probably doesn't actually save you anything as your FS is likely compressed and de-duped (but 'quota' seems to look at the claimed size - not the actual storage).
Similarly, the HTML results can also be compressed - but as mentioned above, I doubt that those will be useful in any case.

aallrd commented 1 year ago

It was sequentially processing each batch on the same server:

SECONDS=0
mkdir -p "${OUTPUT_DIR}"
echo "START: $(date)"
for BATCH_DIR in ${COVERAGE_DIR}/${PID}-*; do
  BATCH_ID="${BATCH_DIR#"${COVERAGE_DIR}/${PID}-"}". 
  echo "BATCH ${BATCH_ID} STARTED: $(date)"
  (
    $(which gcov) --version
    lcov --version
    set -x
    time lcov \
      --parallel 0 \
      --capture \
      --gcov-tool $(which gcov) \
      --forget-test-names \
      --rc geninfo_unexecuted_blocks=1 \
      --output-file "${OUTPUT_DIR}/coverage.${PID}.${BATCH_ID}.info" \
      --build-directory "${SOURCE_DIR}" \
      --directory "${BATCH_DIR}" \
      --substitute "s#/proc/self/cwd/#${SOURCE_DIR}/#g" \
      --exclude '/opt/*' \
      --exclude '/usr/*' \
      --keep-going
    echo "END BATCH ${BATCH_ID}: $(date)"
  ) 2>&1 > "${OUTPUT_DIR}/lcov.${PID}.${BATCH_ID}.log" || {
    >&2 echo "Failed to process batch: ${BATCH_ID} [${BATCH_DIR}]"; return 1;
  }
  echo "BATCH ${BATCH_ID} ENDED: $(date)"
  done
echo "END: $(date)"
echo "Completed: $((SECONDS / 60))min $((SECONDS % 60))sec"

The mean batch duration was ~30sec (with --parallel 0 == 56 cores) and I had 23 batches of 2000 files to run. A few batches were 50% faster/slower but overall it was the same duration for all of them. Dispatching all these batches to X servers for parallel processing could make it faster, but 13min is good enough at the moment compared to 15hours :).

You are correct that there is still some down processing to categorize/split/assign the coverage data to make it usable. What mostly interests me (provided the gathered coverage data is usable), is to compare coverage between runs A & B and extract exclusive coverage for B compared to A.

henry2cox commented 1 year ago

To do A/B compare, differential coverage may be your friend. Especially if you have code changes and/or build environment changes between A and B. If source is unchanged and it is just a case of checking to see what B adds, then this is pretty straightforward. While there are some utilities to do that in the lcov release, I think that you will see some substantial usability issues if you are trying to process data 170K source files. You may also hit capacity and performance issues. Interesting set of problems, though.

henry2cox commented 1 year ago

I think that this issue is fixed in d006ca0d65e4c In my example, it cuts the overhead quite substantially - and I suspect yours will be the same. Could you give it a shot, and let me know what happens? Thanks

aallrd commented 12 months ago

I am using the Perl PAR Packager to compile the lcov binaries to make their deployment easier. I was able to successfully compile the project until 380e007fd2772884489c2a1a8f546ca4e5694f86, where it started to throw this error:

$ pp -M PAR -M Capture::Tiny -M DateTime -M Devel::Cover -M Digest::MD5 -M File::Spec -M JSON::XS -M Memory::Process -M Time::HiRes -x --xargs="--help" -o bin/geninfo /opt/lcov/bin/geninfo
# Use of runtime loader module Module::Runtime detected.  Results of static scanning may be incomplete.
# Use of runtime loader module Module::Implementation detected.  Results of static scanning may be incomplete.
Variable "$files" is not imported at /opt/lcov/lib/lcov/lcovutil.pm line 4141.
        (Did you mean &files instead?)
Variable "$files" is not imported at /opt/lcov/lib/lcov/lcovutil.pm line 4146.
        (Did you mean &files instead?)
Global symbol "$files" requires explicit package name at /opt/lcov/lib/lcov/lcovutil.pm line 4141.
Global symbol "$files" requires explicit package name at /opt/lcov/lib/lcov/lcovutil.pm line 4146.
BEGIN not safe after errors--compilation aborted at /opt/lcov/lib/lcov/lcovutil.pm line 5340.
Compilation failed in require at /opt/lcov/bin/geninfo line 94.
BEGIN failed--compilation aborted at /opt/lcov/bin/geninfo line 94.
SYSTEM ERROR in executing /opt/lcov/bin/geninfo --help: 65280 at /usr/local/share/perl5/Module/ScanDeps.pm line 1523.

Is there an obvious error here?

henry2cox commented 12 months ago

Sorry about that. Merge error on my part that I failed to notice. Should be fixed in bad6da5a1b8931f0.

aallrd commented 12 months ago

The compilation was successful after your fix. I used the same command as before, the output is massively spammed with this kind of lines:

geninfo: WARNING: Lost precision when incrementing 36427319439181856882913133562127226609923927494963553862218199854691873692373644088989484743590607784103430178799091712.000000 by 1
geninfo: WARNING: Lost precision when incrementing 36427319439181856882913133562127226609923927494963553862218199854691873692373644088989484743590607784103430178799091712.000000 by 1

I am not able to read much unfortunately, is there a way to make it less verbose?

henry2cox commented 12 months ago

That looks like the message we see in https://github.com/Perl/perl5/issues/18333 - but I am not sure where in the lcov code this is happening, or where that huge number is coming from. My example doesn't seem to do it - but mine is also many, many fewer files.

Can you add --debug to your command line? That will tell lcov to not remove the error location from the message...so at least we might see where it is coming from. Might provide a clue for how to fix it.

aallrd commented 12 months ago

It tells me it comes from:https://github.com/linux-test-project/lcov/blob/bad6da5a1b8931f0aab358eca3f2e93446b60936/lib/lcovutil.pm#L4110

geninfo: WARNING: Lost precision when incrementing 63519115537784056853898933660673437962600448.000000 by 1 at /tmp/par-6175746f656e67696e65/cache-1abb756504c6853a65ea726f3a43c25eab204841/inc/lib/lcovutil.pm line 4110.

henry2cox commented 12 months ago

That is very strange. This code is just keeping a count of the number of times a particular exclude pattern is used. That count is intended to be initialized to zero and then incremented as required. There is no way that we could have incremented that number of times - so the counter must have been corrupted somehow.
I don't know how.

Not what you want - but you cold remove your exclude patterns for the moment, see if this runs to completion correctly, and then (separately - then exclude them later.

Can you run this in a debugger, stop on that line - and see what type perl thinks that $p is? perl> p ref($p) perl> x $p

Which version of perl are you using?

aallrd commented 12 months ago

Please disregard the previous issues, I must have messed up my compilation/deployment. I bootstrapped a new development environment, re-compiled/re-deployed the binaries and the issue is gone.

$ perl --version
This is perl 5, version 38, subversion 0 (v5.38.0) built for x86_64-linux

I am now testing lcov at 8ac179accde71bd4b10bc6a0e016d2259bdd8f6d:

$ lcov --version
lcov: LCOV version 2.0-30.g8ac179a

Running the same command as previously, I still see the throughput going down:

$ lcov \
  --parallel 0 \
  --capture \
  --gcov-tool $(which gcov) \
  --forget-test-names \
  --rc geninfo_unexecuted_blocks=1 \
  --output-file coverage.413948.info \
  --build-directory SOURCES \
  --directory COVERAGE/413948 \
  --substitute "s#/proc/self/cwd/#SOURCES/#g" \
  --exclude '/opt/*' \
  --exclude '/usr/*' \
  --keep-going
Capturing coverage data from COVERAGE/413948
geninfo cmd: 'geninfo COVERAGE/413948 --output-filename coverage.413948.info --build-directory SOURCES --gcov-tool /opt/bin/gcov --keep-going --rc geninfo_unexecuted_blocks=1 --exclude /opt/* --exclude /usr/* --substitute s#/proc/self/cwd/#SOURCES/#g --parallel 0 --memory 0'
Found gcov version: 11.2.1
Using intermediate gcov format
Writing temporary data to /tmp/geninfo_dat6QrV
Scanning COVERAGE/413948 for .gcda files ...
Found 45375 data files in COVERAGE/413948
elapsed:0.0m: remaining:44975 files 5.6m: 133.68 files/s 0.11 s/file (interval:133.68 f/s 0.11 s/f)
elapsed:0.1m: remaining:44575 files 7.6m: 97.39 files/s 0.13 s/file (interval:76.60 f/s 0.15 s/f)
elapsed:0.2m: remaining:44175 files 8.9m: 83.11 files/s 0.14 s/file (interval:64.27 f/s 0.16 s/f)
elapsed:0.4m: remaining:43775 files 10.9m: 66.67 files/s 0.17 s/file (interval:41.85 f/s 0.25 s/f)
elapsed:0.6m: remaining:43375 files 12.8m: 56.34 files/s 0.20 s/file (interval:34.78 f/s 0.33 s/f)
elapsed:0.8m: remaining:42975 files 14.9m: 48.04 files/s 0.23 s/file (interval:27.67 f/s 0.36 s/f)
elapsed:1.1m: remaining:42575 files 17.0m: 41.69 files/s 0.25 s/file (interval:23.24 f/s 0.42 s/f)
elapsed:1.5m: remaining:42175 files 19.2m: 36.65 files/s 0.28 s/file (interval:19.85 f/s 0.46 s/f)
elapsed:1.9m: remaining:41775 files 21.7m: 32.13 files/s 0.31 s/file (interval:16.18 f/s 0.55 s/f)
elapsed:2.4m: remaining:41375 files 24.8m: 27.79 files/s 0.34 s/file (interval:12.54 f/s 0.63 s/f)
elapsed:2.9m: remaining:40975 files 27.4m: 24.91 files/s 0.36 s/file (interval:12.23 f/s 0.59 s/f)
elapsed:3.5m: remaining:40575 files 29.9m: 22.59 files/s 0.39 s/file (interval:11.16 f/s 0.61 s/f)
elapsed:4.2m: remaining:40175 files 32.7m: 20.48 files/s 0.41 s/file (interval:9.67 f/s 0.65 s/f)
elapsed:5.0m: remaining:39775 files 35.5m: 18.68 files/s 0.43 s/file (interval:8.72 f/s 0.71 s/f)
elapsed:5.8m: remaining:39375 files 38.2m: 17.16 files/s 0.46 s/file (interval:8.02 f/s 0.86 s/f)

I'll let it run until the end to see how it goes.

henry2cox commented 12 months ago

In a word: damn. Not to mention: somewhat disappointing. (Yeah..that is 3 words.) But I may know what is happening. Would need your .json profile from the current job to verify (either attach or email directly).

Glad that the really large integer went away, though. That was puzzling.

I think you are the pioneer WRTis trying the tools on a very large dataset...and we all know what happens to pioneers :-) I do think that, if we can get your example to work with reasonable efficiency, then pretty much everyone will be handled.

I'm concerned that HTML generation is going to be the next bottleneck. OTOH: you probably don't want a giant unified HTML in any case - as it is not likely to be helpful to most users. May also be a security issue if not all users are permitted to see all code.

aallrd commented 12 months ago

Sorry for the late reply, I was facing the lost precision issue again. I have pinpointed its origin to c93f41d7a5a616da2df22254f26e5cc04c9e0787, if I use lcov just before this commit I don't have this warning. I have removed the exclude patterns from my command for the moment to be able to move forward and generate the profile report. Thank you for the continuous help. 👍

henry2cox commented 12 months ago

Yeah - that's the commit that added counters so users could verify whether their patterns were correct or not (and yes - it was motivated by an issue that came up internally, and took a while to diagnose).

Could you try backing down to an earlier version of Perl - to see if that issue is introduced by some perl change. I have never seen it (which is no guarantee of anything, of course) - and I can't reproduce it.

A lot of internal users use 5.22.0 (don't ask).

Another thing to do is to add print("pointer is ", ref($p->[-1]), "\n"); before the increment - so we can see what type perl thinks it is. Similarly, could print ref($p as well (but that is extremely likely to just show that it is an array.

aallrd commented 11 months ago

I added the print("pointer is ", ref($p->[-1]), "\n"); line but nothing was printed on the output, besides the precision warning. I made sure the line was indeed there in lib/lcovutil.pm at the correct location, I don't know what's wrong. I'll try to downgrade the Perl version as you suggested but this could take some time to setup.

In the meantime, I managed to get the profile from 8ac179accde71bd4b10bc6a0e016d2259bdd8f6d by removing the exclusion patterns. I still observe the same throughput decrease overtime, and I have these messages after a while:

elapsed:339.7m: remaining:11375 files 113.6m: 1.67 files/s 0.80 s/file (interval:0.79 f/s 1.44 s/f)
  geninfo: WARNING: Error: geninfo: error 'child 142981 returned non-zero code 9: ignoring data in COVERAGE/output/lib/awesome/o/awesome/file.gcda' waiting for child (try removing the '--parallel' option
  Processing COVERAGE/output/lib/awesome/o/awesome/file.gcda in child 142981
  geninfo: WARNING: Error: unable to create link DNpvDzOverride_serializer.gcda: Not a tty!
    (use "geninfo --ignore-errors parallel ..." to bypass this error)

I am attaching the reports here: anonymized-8ac179-reports.tar.gz

aallrd commented 11 months ago

I am trying to manually debug lcov using the interactive perl debugger but whatever I try I cannot make it break where we need it. I have edited the lcov file shebang to add the debug option #!/usr/bin/perl -d, then I am calling it as usual:

$ lcov ...
Loading DB routines from perl5db.pl version 1.77
Editor support available.
Enter h or 'h h' for help, or 'man perldebug' for more help.
main::(/opt/lcov/bin/lcov:104): our $gcov_dir;
  DB<1> f lcovutil
Choosing /opt/lcov/lib/lcov/lcovutil.pm matching 'lcovutil':
  DB<2> b 4110
  DB<3> L
/opt/lcov/lib/lcov/lcovutil.pm:
 4110:              ++$p->[-1];
    break if (1)
  DB<3> c
<NOT BREAKING>

I added a $DB::single = 1; line just before we want it to stop without any effect either. Can you please share how you are able to set an interactive breakpoint in lcov ?

henry2cox commented 11 months ago

Sorry for the delays. To debug, the easiest (for me) is just to run the corresponding script via perl -d .... So: perl -d /path/to/geninfo -o foo.info .... Perl knows how to debug forking processes - but you do need to be running in an xterm.

I also have a somewhat improved chunky geninfo implementation which I will push shortly. It improves performance (at least for local examples) by making a better tradeoff between work done in each child and queueing overhead. I'm curious to see what you observe - as there may be more improvements.

I also wanted to clarify a slightly different question:

in your project, we are attempting to process about 40K GCDA files.
How many (unique) source files are we looking at?
In the local example I am looking at right now: we have about 1000 source files and roughly 10 suites - so ~10K GCDA files. Things will scale differently as the number of GCDA files increase (due to parallel suites) vs the number of source files increase (as the project gets bigger).

aallrd commented 11 months ago

From what I observed:

Directly calling geninfo with and without the --parallel flag: I don't have the precision warning
Directly calling lcov without the --parallel flag: I don't have the precision warning
- I have the debug print for $p showing on the stdout (nothing useful)
Directly calling lcov with the --parallel flag: I have the precision warning
- I don't have any debug print for $p showing on the stdout

I am running this test from a container, even with TERM=xterm set it seems that Perl is not happy and does not know how to create a new TTY to debug the forking processes.

My use case is a single run (test suite) of our application, something like ./app --version. I don't know how many source files are involved, it's a big monolithic C++ application so probably a lot of them. As a reminder, here are the files I have to process the coverage:

C++ source files: 172021
gcno files: 52372
gcda files: 44782

henry2cox commented 11 months ago

forgot to mention that I pushed 40f4b046fe23e which contains a geninfo performance improvement (or, at least makes some of our examples go noticeably faster). There is an lcovrc knob that you can adjust to try to improve on the guesses that that the tool makes. See the lcovrc and geninfo man pages - and/or ask, when things are unclear.

aallrd commented 11 months ago

I quickly manage to run lcov at 8b1aa600b3ed94cad6ed4f15a300ceac8849793f today:

The throughput is going up, and it does seem noticeably faster (see below)
However the output file is now empty
An error is reported: geninfo: ERROR: Can't locate object method "add_comments" via package "TraceFile"

BATCH 000 STARTED: Tue Jul 25 16:52:48 UTC 2023
$ lcov --version
lcov: LCOV version 2.0-32.g8b1aa60
$ lcov [...]
Found gcov version: 11.2.1
Using intermediate gcov format
Writing temporary data to /tmp/geninfo_datLn6V
Scanning BATCHES/000 for .gcda files ...
Found 45376 data files in BATCHES/000
using: chunkSize: 3024, nchunks:16, intervalLength:2268
elapsed:3.3m: remaining:42352 files 46.5m: 15.17 files/s 0.06 s/file (interval:15.17 f/s 0.06 s/f)
elapsed:3.4m: remaining:39328 files 21.8m: 30.02 files/s 0.06 s/file (interval:1446.82 f/s 0.06 s/f)
elapsed:3.4m: remaining:36304 files 13.6m: 44.48 files/s 0.06 s/file (interval:1210.24 f/s 0.07 s/f)
elapsed:3.4m: remaining:33280 files 9.5m: 58.53 files/s 0.07 s/file (interval:1121.86 f/s 0.07 s/f)
geninfo: ERROR: Can't locate object method "add_comments" via package "TraceFile"
elapsed:3.5m: remaining:30256 files 7.0m: 72.11 files/s 0.07 s/file (interval:993.86 f/s 0.07 s/f)
elapsed:3.5m: remaining:27232 files 5.3m: 85.19 files/s 0.07 s/file (interval:918.75 f/s 0.07 s/f)
elapsed:3.6m: remaining:24208 files 4.1m: 97.86 files/s 0.07 s/file (interval:908.24 f/s 0.07 s/f)
elapsed:3.7m: remaining:21184 files 3.2m: 110.07 files/s 0.07 s/file (interval:871.12 f/s 0.07 s/f)
elapsed:3.7m: remaining:18160 files 2.5m: 121.87 files/s 0.07 s/file (interval:857.45 f/s 0.07 s/f)
elapsed:3.8m: remaining:15136 files 1.9m: 133.20 files/s 0.07 s/file (interval:813.31 f/s 0.07 s/f)
elapsed:3.9m: remaining:12112 files 1.4m: 143.88 files/s 0.07 s/file (interval:726.82 f/s 0.07 s/f)
elapsed:3.9m: remaining:9088 files 1.0m: 154.26 files/s 0.07 s/file (interval:747.98 f/s 0.07 s/f)
elapsed:8.7m: remaining:6048 files 1.3m: 75.28 files/s 0.07 s/file (interval:10.59 f/s 0.11 s/f)
elapsed:10.8m: remaining:3024 files 0.8m: 65.48 files/s 0.08 s/file (interval:24.31 f/s 0.15 s/f)
elapsed:12.9m: remaining:0 files 0.0m: 58.61 files/s 0.08 s/file (interval:23.74 f/s 0.19 s/f)
Finished GDCA file processing
$ du -skh OUTPUT/coverage.000.info
0   OUTPUT/coverage.000.info
END BATCH 000: Tue Jul 25 17:07:53 UTC 2023

What specific parameter in lcovrc did you have in mind for tweaking ?

aallrd commented 11 months ago

Another quick question not related to this performance topic: I am using --exclude '/opt/*' to exclude paths starting by /opt but it seems like it also matches paths like /whatever/opt/file that I want to keep. I tried using ^/opt/* but it did not work, do you have a suggestion?

henry2cox commented 11 months ago

Good news: throughput looks OK. We can claim success. Bad news: another merge issue on my part. Tha's why you see no output. Will push the rest of that feature - or you can just delete the offending line in the meantime.

WRT parameter tweaking: --rc geninfo_chunk_size int[%] If given an integer: process that number of files per chunk. If given a percentage: use that percentage of total files per chunk. "chunk" is the number of GCDA files processed sequentially in the child before returning the result to the parent. A larger number reduces the overhead of merging back to parent, but potentially makes the child run more slowly because it is holding on to more data and also increases vulnerability to a particularly long-running chunk which merrily processes alone for a long time after all the other children have finished (i.e., so we see reduced parallelism). The default is 80% * total_gcda_files / number_of_cores. This was picked kind of randomly after running a bunch of experiments with different values, and eyeballing an apparent minimum.

WRT the second question: includes/excludes are glob patterns - not regexps - which is why carat didn't work. You can specify both --include and --exclude. If some path matches both, then 'exclude' wins. Sadly: that is exactly backward of what you want in this case. A hacky workaround is to aggregate your result - but including only /whatever/opt - into temp.info, then aggregate again - excluding /opt - into temp2.info, then merge temp1 and temp2:

lcov -o temp1.info --include '/whatever/opt/' -a data.info lcov -o temp2.info --exclude '/opt/ -a data.info lcov -o hacky.info -o temp1.info temp2.info

A slightly different hack is to substitute /whatever/opt to something else, then exclude, then substitute back. Two steps, rather than 3.

henry2cox commented 11 months ago

Sorry for the delay. Network issues. I pushed the rest of the feature - so the error you saw should go away.

One more thought WRT exclusion: you could use the --substitute regexp option to make the name of the /opt/ stuff that you want to remove something recognizable that you can then exclude. For example: --substitute 's#^/opt/#REMOVE_ME/#' --exclude 'REMOVE_ME/*'

aallrd commented 11 months ago

I have successfully tested lcov @ 2694692d420fa258cc32c574dccd7e3943d3a910 and I can confirm the success claim :) I used your suggestion --substitute 's#^/opt/#REMOVE_ME/#' --exclude 'REMOVE_ME/*', it also worked as expected. I'll try to run more tests tweaking --rc geninfo_chunk_size int[%] to check its impact as well. Thank you for your help and the fixes !

henry2cox commented 11 months ago

Glad to hear that it worked. I suspect that the bigger challenge will be integrating into your development flow and getting users to adopt.

For curiousity: how long did the capture job run, and how big is the generated .info file? If you happen to have captured the profile: how long did the 'filter' step take?

henry2cox commented 11 months ago

I pushed another small performance enhancement in 5c788308a99. In some local examples, it saves about 50% of the 'filter' time; this is typically a small percentage of overall runtime - but is noticeable in our environment. There are also some RC flags that you can use to turn the feature off or on (on by default) or to tweak parameters to better reflect your environment.

I think that we can close this issue now - unless you see any other problems that I have failed to spot.

Thanks for finding the issue, and for performing the various experiments over the last couple of weeks.

Henry

henry2cox commented 10 months ago

No updates and no additional data on this issue in a few weeks - so I am guessing that this is addressed to everybody's satisfaction. If so: please go ahead and close this issue. If not: please describe the problems you observe - and ideally include either a testcase, profile data, or both. Thanks Henry

aallrd commented 10 months ago

Hello, Sorry for the (very) late reply, I was off in holidays :) I tested https://github.com/linux-test-project/lcov/commit/5c788308a993bd869e705e6bdc98960f4cf5cd75 and noticed an improvement, thank you for that. I'll close the issue, thank you again for the support !

linux-test-project / lcov

Parallel processing throughput keeps decreasing #228