Extreme memory usage when running it on the linux kernel repo with cliff.toml from this project

Byron commented 3 years ago

Describe the bug When running it on https://github.com/torvalds/linux with the cliff.toml from this repository, the git-cliff process will take a lot of time and consume more and more memory. I had to stop it at 12GB.

To Reproduce Steps to reproduce the behavior:

git clone https://github.com/torvalds/linux
cp cliff.toml ./linux/
cd linux
git cliff

Expected behavior A log is produced in reasonable time.

System (please complete the following information):

Ran f1b495d7b1aeb016911150faa0d49f847cc7b17c on MacOS with 8GB of RAM and M1

orhun commented 3 years ago

Thanks for reporting this.

It turns out this high memory usage happens at the following line:

https://github.com/orhun/git-cliff/blob/2b8b4d3535f29231e05c3572e919634b9af907b6/git-cliff/src/lib.rs#L100

Which calls this function:

https://github.com/orhun/git-cliff/blob/2b8b4d3535f29231e05c3572e919634b9af907b6/git-cliff-core/src/repo.rs#L39-L51

In conclusion I'd say this is most likely caused by git2.

So this issue basically boils down to:

use git2::{Commit, Repository, Sort};
use std::env;

fn main() {
    let repo_path = env::var("LINUX_KERNEL_REPO").expect("repo path is not specified");
    let repo = Repository::open(repo_path).expect("cannot open repo");
    let mut revwalk = repo.revwalk().unwrap();
    revwalk.set_sorting(Sort::TIME | Sort::TOPOLOGICAL).unwrap();
    revwalk.push_head().unwrap();
    let commits: Vec<Commit> = revwalk
        .filter_map(|id| id.ok())
        .filter_map(|id| repo.find_commit(id).ok())
        .collect();
    println!("{}", commits.len());
}

To reproduce:

cargo new --bin repro && cd repro/
# add `git2 = "0.13.21"` to [dependencies] in Cargo.toml
# save the code above as src/main.rs
git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git linux
LINUX_KERNEL_REPO="$(pwd)/linux" cargo run

I think you should also report this to git2, there is not much I can do here.

Byron commented 3 years ago

Thanks for investigating this.

It's interesting that running the above I see this:

➜  git-cliff-core git:(main) LINUX_KERNEL_REPO=~/dev/github.com/torvalds/linux/.git /usr/bin/time -lp cargo run --release --example reproduce
    Finished release [optimized] target(s) in 0.09s
     Running `/Users/byron/dev/github.com/orhun/git-cliff/target/release/examples/reproduce`
1015172
real 18.35
user 17.55
sys 0.61
          2273345536  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
              199192  page reclaims
                 645  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   1  signals received
                 589  voluntary context switches
                1596  involuntary context switches
        176305556133  instructions retired
         57509393673  cycles elapsed
          1513012480  peak memory footprint

Maybe the real memory explosion happens elsewhere when processing more than a million commits.

orhun commented 3 years ago

It's interesting that running the above I see this:

Ah, I just get a similar result. But it took longer due to my low specs I guess.

Maybe the real memory explosion happens elsewhere when processing more than a million commits.

I'm re-investigating this issue. 👍🏼

alerque commented 3 years ago

Sadly libgit2 is missing some significant optimizations that the git CLI tooling has. I've run into resource issues like this on much smaller repos than the Linux kernel where the CLI tooling flies right along and the equivalent calls to the library sink the ship.

orhun commented 3 years ago

I pushed f85974761be11e0ecc85575bc4b6d5a02e438fd2 and it should affect the performance dramatically. In fact, I was able to generate a changelog from the linux kernel repository this time:

$ cargo run --release -- -r ~/gh/linux/ -c cliff.toml -o LINUX_CHANGELOG

results in:

# Changelog
All notable changes to this project will be documented in this file.

## [unreleased]

### ALSA

- Pcm: Fix mmap breakage without explicit buffer setup
- Hda/realtek: fix mute/micmute LEDs for HP ProBook 650 G8 Notebook PC

### MAINTAINERS

- Update Vineet's email address
- Fix Microchip CAN BUS Analyzer Tool entry typo
- Switch to my OMP email for Renesas Ethernet drivers

### Security

- Igmp: fix data-race in igmp_ifc_timer_expire()

[...]

Can you try it out to see if it's any better?

Byron commented 3 years ago

Fantastic, the fix is probably one of the most effective one-line changes I have ever seen!

Here it the tail of my cliff run on the linux kernel:

real 31.97
user 25.32
sys 3.68
          2934489088  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
             1154355  page reclaims
                  59  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
               25376  voluntary context switches
               16458  involuntary context switches
            32094006  instructions retired
            20068389  cycles elapsed
             2786048  peak memory footprint

I think that's quite alright :).

In case you are interested in being even faster, here is another tool to estimate the hours it would take to implement the commits of a repository.

➜  linux git:(master) ✗ /usr/bin/time -lp gix tools estimate-hours
 9:49:55 Traverse commit graph done 1.0M commits in 7.55s (134.5k commits/s)
total hours: 979612.44
total 8h days: 122451.55
total commits = 1015172
total authors: 28234
total unique authors: 21359 (24.35% duplication)
 9:49:56                  find Extracted and organized data from 1015172 commits in 807.375125ms (1257373 commits/s)
real 8.45
user 8.21
sys 1.13
          1743454208  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
              117714  page reclaims
               11193  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                   3  voluntary context switches
                9337  involuntary context switches
         54347066220  instructions retired
         28594401548  cycles elapsed
           976183360  peak memory footprint

orhun / git-cliff

Extreme memory usage when running it on the linux kernel repo with cliff.toml from this project #1