Closed jpmcb closed 1 year ago
I think its a valid concern for sure. Like you said, figuring out how long to keep a repo on disk is a challenge also if there’s no reasonable way to check its activity outside of pinging GitHub to see if its been updated recently.
Maybe there is a threshold of days without new commits that initiates some notification to us or someone monitoring that repo.
Maybe there is a threshold of days without new commits that initiates some notification to us or someone monitoring that repo.
Keeping repos on disk would more or less be an operation of the LRU cache hits. So, if there were 3 repos on disk that had been last used with this ordering:
last used: 0
https://github.com/open-sauced/insights
last used: 1
https://github.com/open-sauced/pizza
last used: 2
https://github.com/open-sauced/pizza-cli
and a new request came into the /bake
route with:
{
"url": "https://github.com/open-sauced/pizza-cli"
}
the pizza-oven would use the pizza-cli
repo already on disc, fetch new updates, process those new changes, and update the cache to reflect that:
last used: 1
https://github.com/open-sauced/insights
last used: 2
https://github.com/open-sauced/pizza
last used: 0
https://github.com/open-sauced/pizza-cli
If another request came in that was for a repo not on disc and there was only capacity for 3 repos, the least recently used repo on disc would be evicted:
{
"url": "https://github.com/open-sauced/ai"
}
last used: 2
https://github.com/open-sauced/insights
last used: 0
https://github.com/open-sauced/ai
last used: 1
https://github.com/open-sauced/pizza-cli
Right now, evicting would just be deleting the repo from disc but in the future, we may consider offloading those repos to cold storage (like S3) and querying cold storage for those repos when a new request comes in. This way, we don't have to always revert back to git clone
(especially for large repos). But cold storage probably deserves its own issue and is abit out of scope for this specific issue - I only mention it since it makes offloading files to old much easier compared to getting them from the in memory representation.
There's also probably a list of a couple hundred repos that we'd always want on disk no matter their last usage in the cache, so we could also provide a yaml file that gets read in as configuration for which repos to always keep around:
config:
cache:
neverEvict:
- https://github.com/open-sauced/insights
- https://github.com/open-sauced/pizza
- https://github.com/open-sauced/pizza-cli
- ...
:tada: This issue has been resolved in version 1.0.0 :tada:
The release is available on GitHub release
Your semantic-release bot :package::rocket:
Type of feature
🍕 Feature
Current behavior
The current POC of the pizza-oven contains a critical bottleneck: it depends on significant amounts of memory in order to reliably clone large repos (or multiple smaller repos at the same time) directly into memory. If memory runs out, the entire server crashes.
Suggested solution
Instead, a likely better (albeit more complex) solution would be to clone repos to disk. A single
pizza-oven
instance could be connected with many terabytes of disk space which would expand it's singleton scalability dramatically.On the disk itself, this would lend itself very nicely to an LRU cache architecture:
This would help ensure:
git clone
rate limiting since we can simply pole new changes for repos already on diskImplementing the LRU cache filesystem on the disk would make this service more complex since cache hits and calculating positions of the different reops across different Go routines would couple the different processing more tightly, but I think it's the right direction to go to ensure a single pizza-oven service is usable by a sole open source contributor.
There'd also be a few questions to consider around edge cases as well:
Additional context
No response
Code of Conduct
Contributing Docs