lean-dojo / LeanDojo

Tool for data extraction and interacting with Lean programmatically.
https://leandojo.org
MIT License
536 stars 80 forks source link

Incorporate Recent Commits #192

Closed yangky11 closed 1 month ago

yangky11 commented 1 month ago

Still a draft. Don't merge.

RexWzh commented 1 month ago

About the Cache

Here use relative paths in cache.get and cache.store, determining the cache path by LeanGitRepo.format_dirname rather than within the cache functions.

Original Method:

def get(self, url: str, commit: str) -> Optional[Path]:
    _, repo_name = _split_git_url(url)
    dirname = _format_dirname(url, commit)

def store(self, src: Path) -> Path:
    url, commit = get_repo_info(src)
    dirpath = self.cache_dir / _format_dirname(url, commit)
    _, repo_name = _split_git_url(url)

# Cache the traced repo
src_dir = tmp_dir / repo.name
path = cache.store(src_dir)

Modified Version:

def get(self, rel_cache_dir: Path) -> Optional[Path]:
    dirname = rel_cache_dir.parent
    dirpath = self.cache_dir / dirname
    cache_path = self.cache_dir / rel_cache_dir

def store(self, src: Path, rel_cache_dir: Path) -> Path:
    """Store a repo at path ``src``. Return its path in the cache.

    Args:
        src (Path): Path to the repo.
        rel_cache_dir (Path): The relative path of the stored repo in the cache.
    """
    dirpath = self.cache_dir / rel_cache_dir.parent
    cache_path = self.cache_dir / rel_cache_dir

# Cache the traced repo
src_dir = tmp_dir / repo.name
rel_cache_dir = repo.format_dirname / repo.name
path = cache.store(src_dir, rel_cache_dir)

The function get_repo_info in utils.py can be removed.

Moreover, cache.store is used for storing repo cache for non-github repo types. The LeanGitRepo.repo is a git.Repository object, with the working directory in CACHE_DIR/repos.

$HOME/.cache/lean_dojo
├── aesop-79fb157c6a5061190d169535f8e5cb007914a82e
│   └── aesop
├── batteries-e7897807913fafdab31b01b9f627550bcc96cff2
│   └── batteries
...
├── repos
│   ├── gitpython-aesop-79fb157c6a5061190d169535f8e5cb007914a82e
│   ├── gitpython-batteries-e7897807913fafdab31b01b9f627550bcc96cff2

For git repo, it uses a fixed prefix (gitpython) rather than the second-to-last directory name to avoid introducing temporary file names or duplicating the repo cache.


btw, the repo_type_of_url function in utils.py is duplicated and can therefore be deleted.

RexWzh commented 1 month ago

A note about the cache that might be helpful for the review process. Feel free to @ me if you get any questions😃