Closed zergov closed 1 week ago
J'ai aussi faite un figma qui montre un peu les differents flow, on pourra utiliser ca dans les rapports : https://www.figma.com/board/CXNpHkeO8oTvorDGhY4hP0/PFE---ETS---Github-visualization-tool?node-id=2-70&node-type=code_block&t=YIgNsUtGaGi3F8bI-0
Ok yeah, downloading the whole repository everytime we want stats about it is surely slow. Still confusing as to why it's that slow right now though. Makes no sense to take 10+ min to clone the repo and fetch commit data. Was it because it parses all commit data to json? Anyways, I like the changes. We should add an expiration date to the cache so it doesn't stay there indefinitely. Maybe 1 day should suffice. It would be nice to also have a way to force refresh the cache for a stat request because there could be very active repositories that have multiple commits in a day. p.s. data return arrows should be dotted on a sequence diagram :)
Still confusing as to why it's that slow right now though. Makes no sense to take 10+ min to clone the repo and fetch commit data. Was it because it parses all commit data to json?
Yoo @syw1-art π
I profiled the /commits
endpoint here: https://github.com/zergov/GihubVisualisation/pull/1
The reason it's so slow is:
If you're dealing with a project with 10_000s of commits, that takes a while because Git has to be invoked 10_000s time.
We should add an expiration date to the cache so it doesn't stay there indefinitely. Maybe 1 day should suffice.
Which cache are you talking about? If you're talking about the TinyDB cache (the json we have right now), then ill go further and just say that we don't need this JSON at all.
If you're talking about the Sqlite databases, these are not cache and we should never delete that. Hitting a Sqlite database is soooooo much faster than hitting git and parsing its output. We should always read from our db.
It would be nice to also have a way to force refresh the cache for a stat request because there could be very active repositories that have multiple commits in a day.
Yes! I was thinking we could add a refresh button on the frontend to trigger a flow that will:
I was not talking about the current way of storing things. It can go once we implement this new suggested system. The reasons I say we put expiration dates on git repo data is twofold:
Correct me if I understood your proposal wrong.
I was not talking about the current way of storing things. It can go once we implement this new suggested system.
Gotcha π
If we inspect stats of many git repos, we would be storing them all in the db to my understanding so it would take up much space if we never delete them.
The space it takes is reasonable. My local database of a big project like rails is 50mb, and that's for a large and popular repo that was started 20 years ago. That's very small so I don't think storage is an issue here. Also, storage is the cheapest resource. I will gladly trade storage for speed.
since we will directly access git data from the db if we find them there, we will miss out on more recent commits that we do not fetch from git. e.g. I request data about a repo today, commits were added to that repo tomorrow, I request the data once again 2 days later -> we don't get the information about commits made tomorrow because we only hit the db. Unless you want us to have to click the refresh button to fetch from git
Very gooood observation π I think we can come up with a strategy that will sync the database whenever a user asks for stats about a repo. Some solutions:
If a user asks for stats about github.com/rails/rails
, and the repo was synced 5 minutes ago, then we serve whatever we have in our database. We should show the user when was the last sync. If the user knows its out of sync, then that user can just hit the "refresh" button.
If a user asks for stats about github.com/rails/rails
, and the repo was sync yesterday, then we should just automatically enqueue a re-indexing job for the missing data, and let the user know that we're syncing the repo in the background. The frontend can be smart enough to poll the API to know when the sync is complete. This should be fairly quick because we just have to sync the delta between the latest commit in our database, and the latest commit on the repo.
p.s. data return arrows should be dotted on a sequence diagram :)
Thanks @syw1-art, I updated all the sequence diagrams π
Wow ok since it takes that little space then it's fine, basically makes no difference.
Another solution is maybe we could request github for the last commit date and compare it to the latest commit date we have stored in db on every git repo stats request so we know its synced or not?
This process seems good overall, but what about the other members of the team?
Another solution is maybe we could request github for the last commit date and compare it to the latest commit date we have stored in db on every git repo stats request so we know its synced or not?
Yoo I love it π
Closing since we agreed we're going in that direction.
Proposal: Extract repository information in an async job
Today, this is how the API serves a user requesting information about a repository:
The orange section is very expensive. We shouldn't do this work each read requests. We should clone and invoke Git to extract information about the repo only once. The extracted data should be written to a database, and the application should read from this database.
This is the new flow I propose we do: Scenario: The user requests information about the
github.com/rails/rails
repository.In this model:
github.com/rails/rails
repository in our database and enqueue a background job to process that repo.The background job will clone, extract data from the repo, and write that data into our database. Once its done, it will mark the repo as "ready".
Once that's done, future requests for the
github.com/rails/rails
repository will use this flow:The request can be served to the user without having to invoke Git or clone the repository.
References
Reference to some benchmark I did to see how Git performs: https://github.com/zergov/git-benchmarking