rust-lang / git2-rs

libgit2 bindings for Rust
https://docs.rs/git2
Apache License 2.0
1.72k stars 390 forks source link

Is there any way to get last commit of a certain file? #588

Open kvzn opened 4 years ago

kvzn commented 4 years ago

It should be something like the command git log --follow FILENAME. revwalk might work but would lead to tons of computing. Do we have another ways? thank you!

alexcrichton commented 4 years ago

This might be best to ask libgit2 itself since this library only wraps libgit2. I'm not personally familiar myself with an API to do this, but I don't have an encyclopedic knowledge of the API.

extrawurst commented 4 years ago

@kevinzheng I was looking for something similar and turns out revwalk is the way to go, its how TortoiseGit does it aswell: https://github.com/TortoiseGit/TortoiseGit/blob/master/src/TortoiseShell/GITPropertyPage.cpp#L369

here is a good reference issue in libgit2: https://github.com/libgit2/libgit2/issues/495

extrawurst commented 4 years ago

@kevinzheng although I am kind of intrigued to benchmark this approach against using the blame functions: https://libgit2.org/libgit2/#HEAD/type/git_blame then each blame_hunk contains the commit and git_signature which in turn contains a git_time

kvzn commented 4 years ago

@extrawurst @alexcrichton would you pls take a look at my implementation? It looks like working, but I haven't tested the performance, and I don't know how to handle the commits with multiple parents, thank you!

#[derive(Debug, Deserialize, Serialize, PartialEq, Clone)]
pub struct Commit {
    pub commit_id: String,
    pub message: String,
    pub time: NaiveDateTime,
    pub author: Signature,
    pub committer: Signature,
}
pub fn last_commit_of_file_or_dir(
    repo: &Repository,
    file_path: &str,
    from_commit_id: Option<&str>,
) -> Result<crate::beans::Commit, AppError> {
    let mut revwalk = repo.revwalk()?;
    revwalk.set_sorting(git2::Sort::TIME)?;

    match from_commit_id {
        Some(from_cid) => match Oid::from_str(from_cid) {
            Ok(oid) => revwalk.push(oid)?,
            Err(e) => return Err(AppError::Git2Error(e)),
        },
        None => revwalk.push_head()?,
    }

    while let Some(oid) = revwalk.next() {
        let oid = oid?;

        if let cmt = repo.find_commit(oid)? {
            let tree = cmt.tree()?;

            let old_tree = if cmt.parent_count() > 0 {
                // TODO: multiple parents???
                let parent_commit = cmt.parent(0)?;
                Some(parent_commit.tree()?)
            } else {
                None
            };

            let mut opts = DiffOptions::new();
            let diff = repo.diff_tree_to_tree(old_tree.as_ref(), Some(&tree), Some(&mut opts))?;

            let mut deltas = diff.deltas();

            let contains = deltas.any(|dd| {
                let new_file_path = dd.new_file().path().unwrap();
                // File || Dir
                new_file_path.eq(Path::new(&file_path)) || new_file_path.starts_with(&file_path)
            });

            if contains {
                let c = git2_commit_to_our_commit(&cmt)?;
                return Ok(c);
            }
        }
    }
    return Err(AppError::CommandError(format!(
        "Failed to get last commit of file {}!",
        &file_path
    )));
}
fn git2_commit_to_our_commit(commit: &git2::Commit) -> Result<crate::beans::Commit, AppError> {
    let message = commit.message().unwrap_or("").to_string();

    let author = crate::beans::Signature {
        user_id: None,
        name: commit.author().name().unwrap_or("".as_ref()).to_string(),
        email: commit.author().email().unwrap_or("".as_ref()).to_string(),
    };

    let committer = crate::beans::Signature {
        user_id: None,
        name: commit.committer().name().unwrap_or("".as_ref()).to_string(),
        email: commit
            .committer()
            .email()
            .unwrap_or("".as_ref())
            .to_string(),
    };

    let time = git2_time_to_chrono_time(commit.time());

    Ok(crate::beans::Commit {
        commit_id: commit.id().to_string(),
        message,
        time,
        committer,
        author,
    })
}
Shnatsel commented 3 years ago

It appears that this is a widely requested feature - nearly every language wrapper has a feature request for it - e.g. https://github.com/libgit2/pygit2/issues/231. However, it's not implemented in git2 - here's the upstream feature request: https://github.com/libgit2/libgit2/issues/495.

Someone has contributed a custom implementation for the C# bindings, although I haven't looked at it in detail: https://github.com/libgit2/libgit2sharp/pull/963

I've rolled my own implementation, but it reports different timestamps compared to git log for half of the files in the repo I care about.

For ease of testing I list the timestamps for all the files that ever existed in the repository, rather than attempting to filter further. Here's my code:

// Copyright 2021 Google, inc.
// SPDX-License-identifier: Apache-2.0

use std::{collections::HashMap, path::PathBuf};
use git2::{Commit, Repository, Tree, Error};

fn main() -> Result<(), Error> {
    let mut mtimes: HashMap<PathBuf, i64> = HashMap::new();
    let repo = Repository::open(".")?;
    let mut revwalk = repo.revwalk()?;
    revwalk.set_sorting(git2::Sort::TIME)?;
    revwalk.push_head()?;
    let mut newer_commit: Option<Commit> = None;
    let mut newer_commit_tree: Option<Tree> = None;
    for commit_id in revwalk {
        let commit_id = commit_id?;
        let commit = repo.find_commit(commit_id)?;
        if commit.parent_count() > 1 {
            // ignore merge commits because they touch lots of files
            // without any of them being actually modified
            continue;
        }
        let tree = commit.tree()?;
        // check if this is not the very first commit, then we have nothing to diff
        if let Some(newer_commit_tree) = newer_commit_tree {
            let diff= repo.diff_tree_to_tree(Some(&tree), Some(&newer_commit_tree), None)?;
            for delta in diff.deltas() {
                let file_path = delta.new_file().path().unwrap();
                let file_mod_time = newer_commit.as_ref().unwrap().time();
                let unix_time = file_mod_time.seconds();
                mtimes.entry(file_path.to_owned()).or_insert(unix_time);
            }
        }
        newer_commit = Some(commit);
        newer_commit_tree = Some(tree);
    }
    for (path, time) in mtimes.iter() {
        println!("{:?}: {}", path, time);
    }
    Ok(())    
}

Here's a (slower) reference BASH implementation using git log that outputs the data in the same format for ease of comparison:

#!/bin/bash
git ls-files | while read FILENAME; do 
    TIME=$( git log -1 --format="%ct" -- "$FILENAME" )
    echo "\"${FILENAME#./}\": $TIME"
done

The BASH version aligns with the output of git whatchanged --pretty='%ct', but my git2-based impl does not. git2-based implementation tends to report newer dates than those in git whatchanged.

Fixes I've attempted:

Edit: Ah, that's probably because I'm walking the commit log chronologically using git2::Sort::TIME. If I instead walk them by parent links, it should work better.

Shnatsel commented 3 years ago

Okay, this works:

// Copyright 2021 Google, inc.
// SPDX-License-identifier: Apache-2.0

use std::{cmp::max, collections::HashMap, path::PathBuf};
use git2::{Repository, Error};

fn main() -> Result<(), Error> {
    let mut mtimes: HashMap<PathBuf, i64> = HashMap::new();
    let repo = Repository::open(".")?;
    let mut revwalk = repo.revwalk()?;
    revwalk.set_sorting(git2::Sort::TIME)?;
    revwalk.push_head()?;
    for commit_id in revwalk {
        let commit_id = commit_id?;
        let commit = repo.find_commit(commit_id)?;
        // Ignore merge commits (2+ parents) because that's what 'git whatchanged' does.
        // Ignore commit with 0 parents (initial commit) because there's nothing to diff against
        if commit.parent_count() == 1 {
            let prev_commit = commit.parent(0)?;
            let tree = commit.tree()?;
            let prev_tree = prev_commit.tree()?;
            let diff= repo.diff_tree_to_tree(Some(&prev_tree), Some(&tree), None)?;
            for delta in diff.deltas() {
                let file_path = delta.new_file().path().unwrap();
                let file_mod_time = commit.time();
                let unix_time = file_mod_time.seconds();
                mtimes.entry(file_path.to_owned())
                .and_modify(|t| *t = max(*t, unix_time) )
                .or_insert(unix_time);
            }
        }
    }
    for (path, time) in mtimes.iter() {
        println!("{:?}: {}", path, time);
    }
    Ok(())    
}

A MIT/Apache licensed version can be found here.

Edit: although it looks like this code will miss files only touched in the initial commit. A solution can be found here.