Closed Byron closed 3 years ago
Thanks for reporting this.
It turns out this high memory usage happens at the following line:
Which calls this function:
In conclusion I'd say this is most likely caused by git2.
So this issue basically boils down to:
use git2::{Commit, Repository, Sort};
use std::env;
fn main() {
let repo_path = env::var("LINUX_KERNEL_REPO").expect("repo path is not specified");
let repo = Repository::open(repo_path).expect("cannot open repo");
let mut revwalk = repo.revwalk().unwrap();
revwalk.set_sorting(Sort::TIME | Sort::TOPOLOGICAL).unwrap();
revwalk.push_head().unwrap();
let commits: Vec<Commit> = revwalk
.filter_map(|id| id.ok())
.filter_map(|id| repo.find_commit(id).ok())
.collect();
println!("{}", commits.len());
}
To reproduce:
cargo new --bin repro && cd repro/
# add `git2 = "0.13.21"` to [dependencies] in Cargo.toml
# save the code above as src/main.rs
git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git linux
LINUX_KERNEL_REPO="$(pwd)/linux" cargo run
I think you should also report this to git2
, there is not much I can do here.
Thanks for investigating this.
It's interesting that running the above I see this:
➜ git-cliff-core git:(main) LINUX_KERNEL_REPO=~/dev/github.com/torvalds/linux/.git /usr/bin/time -lp cargo run --release --example reproduce
Finished release [optimized] target(s) in 0.09s
Running `/Users/byron/dev/github.com/orhun/git-cliff/target/release/examples/reproduce`
1015172
real 18.35
user 17.55
sys 0.61
2273345536 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
199192 page reclaims
645 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
1 signals received
589 voluntary context switches
1596 involuntary context switches
176305556133 instructions retired
57509393673 cycles elapsed
1513012480 peak memory footprint
Maybe the real memory explosion happens elsewhere when processing more than a million commits.
It's interesting that running the above I see this:
Ah, I just get a similar result. But it took longer due to my low specs I guess.
Maybe the real memory explosion happens elsewhere when processing more than a million commits.
I'm re-investigating this issue. 👍🏼
Sadly libgit2 is missing some significant optimizations that the git
CLI tooling has. I've run into resource issues like this on much smaller repos than the Linux kernel where the CLI tooling flies right along and the equivalent calls to the library sink the ship.
I pushed f85974761be11e0ecc85575bc4b6d5a02e438fd2 and it should affect the performance dramatically. In fact, I was able to generate a changelog from the linux kernel repository this time:
$ cargo run --release -- -r ~/gh/linux/ -c cliff.toml -o LINUX_CHANGELOG
results in:
# Changelog
All notable changes to this project will be documented in this file.
## [unreleased]
### ALSA
- Pcm: Fix mmap breakage without explicit buffer setup
- Hda/realtek: fix mute/micmute LEDs for HP ProBook 650 G8 Notebook PC
### MAINTAINERS
- Update Vineet's email address
- Fix Microchip CAN BUS Analyzer Tool entry typo
- Switch to my OMP email for Renesas Ethernet drivers
### Security
- Igmp: fix data-race in igmp_ifc_timer_expire()
[...]
Can you try it out to see if it's any better?
Fantastic, the fix is probably one of the most effective one-line changes I have ever seen!
Here it the tail of my cliff run on the linux kernel:
real 31.97
user 25.32
sys 3.68
2934489088 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
1154355 page reclaims
59 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
25376 voluntary context switches
16458 involuntary context switches
32094006 instructions retired
20068389 cycles elapsed
2786048 peak memory footprint
I think that's quite alright :).
In case you are interested in being even faster, here is another tool to estimate the hours it would take to implement the commits of a repository.
➜ linux git:(master) ✗ /usr/bin/time -lp gix tools estimate-hours
9:49:55 Traverse commit graph done 1.0M commits in 7.55s (134.5k commits/s)
total hours: 979612.44
total 8h days: 122451.55
total commits = 1015172
total authors: 28234
total unique authors: 21359 (24.35% duplication)
9:49:56 find Extracted and organized data from 1015172 commits in 807.375125ms (1257373 commits/s)
real 8.45
user 8.21
sys 1.13
1743454208 maximum resident set size
0 average shared memory size
0 average unshared data size
0 average unshared stack size
117714 page reclaims
11193 page faults
0 swaps
0 block input operations
0 block output operations
0 messages sent
0 messages received
0 signals received
3 voluntary context switches
9337 involuntary context switches
54347066220 instructions retired
28594401548 cycles elapsed
976183360 peak memory footprint
Describe the bug When running it on https://github.com/torvalds/linux with the cliff.toml from this repository, the
git-cliff
process will take a lot of time and consume more and more memory. I had to stop it at 12GB.To Reproduce Steps to reproduce the behavior:
Expected behavior A log is produced in reasonable time.
System (please complete the following information):
Ran f1b495d7b1aeb016911150faa0d49f847cc7b17c on MacOS with 8GB of RAM and M1