lean-dojo / ReProver

Retrieval-Augmented Theorem Provers for Lean
https://leandojo.org
MIT License
208 stars 44 forks source link

Deduplicate (url, commits) pairs before checking LeanGitRepo in cache to speed up trace_repos.py #14

Closed antonkov closed 1 year ago

antonkov commented 1 year ago

Following README.md for python scripts/download_data.py followed by python scripts/trace_repos.py trace_repos takes significant time to find repos to trace from 285690 entries in jsons with 42.73it/s with no visual progress report. (~6330 seconds)

Deduplicating (URL, commits) before creating repo object and checking in cache filesystem speeds it up to ~10s

antonkov commented 1 year ago

Especially as the result after deduplication is just 4 repos (mathlib, std, lean4, and qq)

yangky11 commented 1 year ago

Hi @antonkov,

Thank you for catching this! Could you please check if simply changing if not is_available_in_cache(repo) to if repo not in repos and not is_available_in_cache(repo) is sufficient, since LeanGitRepo is hashable?

antonkov commented 1 year ago

It gets to ~10 seconds when the cache is empty because it just checks each entry once and adds it to repos therefore short-circuiting is_available_in_cache for future calls. However it degrades to the worst case (hangs for ~minutes) when most of the entries are already in cache since is_available_in_cache returns true and repo is not added to repos which forces is_available_in_cache to be called again and again (which seems to be the bottleneck not the LeanGitRepo creation

yangky11 commented 1 year ago

Thanks!