open-sauced / pizza

This is an engine that sources git commits and turns them to insights
Apache License 2.0
32 stars 13 forks source link

Feature: investigate and spike on cloning repos to disk #14

Closed jpmcb closed 1 year ago

jpmcb commented 1 year ago

Type of feature

🍕 Feature

Current behavior

The current POC of the pizza-oven contains a critical bottleneck: it depends on significant amounts of memory in order to reliably clone large repos (or multiple smaller repos at the same time) directly into memory. If memory runs out, the entire server crashes.

Suggested solution

Instead, a likely better (albeit more complex) solution would be to clone repos to disk. A single pizza-oven instance could be connected with many terabytes of disk space which would expand it's singleton scalability dramatically.

IMG_02044FED3390-1

On the disk itself, this would lend itself very nicely to an LRU cache architecture:

if git repo already on disk:
    pull most recent changes for that repo on disk
else:
    evict least recently used repos on disk until enough disk space is available for cloning the new repo
    clone new repo

process git commits for repo on disk

This would help ensure:

  1. we're not hitting the git clone rate limiting since we can simply pole new changes for repos already on disk
  2. for "busy" repos, an on disk cache will ensure there's always an available and fast solution for getting new changes

Implementing the LRU cache filesystem on the disk would make this service more complex since cache hits and calculating positions of the different reops across different Go routines would couple the different processing more tightly, but I think it's the right direction to go to ensure a single pizza-oven service is usable by a sole open source contributor.

There'd also be a few questions to consider around edge cases as well:

Additional context

No response

Code of Conduct

Contributing Docs

brandonroberts commented 1 year ago

I think its a valid concern for sure. Like you said, figuring out how long to keep a repo on disk is a challenge also if there’s no reasonable way to check its activity outside of pinging GitHub to see if its been updated recently.

Maybe there is a threshold of days without new commits that initiates some notification to us or someone monitoring that repo.

jpmcb commented 1 year ago

Maybe there is a threshold of days without new commits that initiates some notification to us or someone monitoring that repo.

Keeping repos on disk would more or less be an operation of the LRU cache hits. So, if there were 3 repos on disk that had been last used with this ordering:

last used: 0
https://github.com/open-sauced/insights

last used: 1
https://github.com/open-sauced/pizza

last used: 2
https://github.com/open-sauced/pizza-cli

and a new request came into the /bake route with:

{
    "url": "https://github.com/open-sauced/pizza-cli"
}

the pizza-oven would use the pizza-cli repo already on disc, fetch new updates, process those new changes, and update the cache to reflect that:

last used: 1
https://github.com/open-sauced/insights

last used: 2
https://github.com/open-sauced/pizza

last used: 0
https://github.com/open-sauced/pizza-cli

If another request came in that was for a repo not on disc and there was only capacity for 3 repos, the least recently used repo on disc would be evicted:

{
    "url": "https://github.com/open-sauced/ai"
}
last used: 2
https://github.com/open-sauced/insights

last used: 0
https://github.com/open-sauced/ai

last used: 1
https://github.com/open-sauced/pizza-cli

Right now, evicting would just be deleting the repo from disc but in the future, we may consider offloading those repos to cold storage (like S3) and querying cold storage for those repos when a new request comes in. This way, we don't have to always revert back to git clone (especially for large repos). But cold storage probably deserves its own issue and is abit out of scope for this specific issue - I only mention it since it makes offloading files to old much easier compared to getting them from the in memory representation.

There's also probably a list of a couple hundred repos that we'd always want on disk no matter their last usage in the cache, so we could also provide a yaml file that gets read in as configuration for which repos to always keep around:

config:
    cache:
        neverEvict:
            - https://github.com/open-sauced/insights
            - https://github.com/open-sauced/pizza
            - https://github.com/open-sauced/pizza-cli
            - ...
open-sauced[bot] commented 1 year ago

:tada: This issue has been resolved in version 1.0.0 :tada:

The release is available on GitHub release

Your semantic-release bot :package::rocket: