zergov commented 3 weeks ago

Proposal: Extract repository information in an async job

Today, this is how the API serves a user requesting information about a repository:

The orange section is very expensive. We shouldn't do this work each read requests. We should clone and invoke Git to extract information about the repo only once. The extracted data should be written to a database, and the application should read from this database.

This is the new flow I propose we do: Scenario: The user requests information about the github.com/rails/rails repository.

In this model:

We first check to see if we have data about the repository
If we don't, we create an entry for the github.com/rails/rails repository in our database and enqueue a background job to process that repo.
we return to the user instantly.

The background job will clone, extract data from the repo, and write that data into our database. Once its done, it will mark the repo as "ready".

Once that's done, future requests for the github.com/rails/rails repository will use this flow:

The request can be served to the user without having to invoke Git or clone the repository.

References

Reference to some benchmark I did to see how Git performs: https://github.com/zergov/git-benchmarking

zergov commented 3 weeks ago

J'ai aussi faite un figma qui montre un peu les differents flow, on pourra utiliser ca dans les rapports : https://www.figma.com/board/CXNpHkeO8oTvorDGhY4hP0/PFE---ETS---Github-visualization-tool?node-id=2-70&node-type=code_block&t=YIgNsUtGaGi3F8bI-0

syw1-art commented 3 weeks ago

Ok yeah, downloading the whole repository everytime we want stats about it is surely slow. Still confusing as to why it's that slow right now though. Makes no sense to take 10+ min to clone the repo and fetch commit data. Was it because it parses all commit data to json? Anyways, I like the changes. We should add an expiration date to the cache so it doesn't stay there indefinitely. Maybe 1 day should suffice. It would be nice to also have a way to force refresh the cache for a stat request because there could be very active repositories that have multiple commits in a day. p.s. data return arrows should be dotted on a sequence diagram :)

zergov commented 3 weeks ago

Still confusing as to why it's that slow right now though. Makes no sense to take 10+ min to clone the repo and fetch commit data. Was it because it parses all commit data to json?

Yoo @syw1-art 👋

I profiled the /commits endpoint here: https://github.com/zergov/GihubVisualisation/pull/1 The reason it's so slow is:

cloning the repo takes a while for a large repo
accessing the stats of which file changed takes very long because PyDriller asks the Git executable for these stats on each commit.

If you're dealing with a project with 10_000s of commits, that takes a while because Git has to be invoked 10_000s time.

We should add an expiration date to the cache so it doesn't stay there indefinitely. Maybe 1 day should suffice.

Which cache are you talking about? If you're talking about the TinyDB cache (the json we have right now), then ill go further and just say that we don't need this JSON at all.

If you're talking about the Sqlite databases, these are not cache and we should never delete that. Hitting a Sqlite database is soooooo much faster than hitting git and parsing its output. We should always read from our db.

It would be nice to also have a way to force refresh the cache for a stat request because there could be very active repositories that have multiple commits in a day.

Yes! I was thinking we could add a refresh button on the frontend to trigger a flow that will:

Check to see if the latest commit on github is the same we have in our database.
If not, then enqueue an indexing job that will extract the missing data from the repo, starting by the latest commit we have in our database.

syw1-art commented 3 weeks ago

I was not talking about the current way of storing things. It can go once we implement this new suggested system. The reasons I say we put expiration dates on git repo data is twofold:

If we inspect stats of many git repos, we would be storing them all in the db to my understanding so it would take up much space if we never delete them.
since we will directly access git data from the db if we find them there, we will miss out on more recent commits that we do not fetch from git. e.g. I request data about a repo today, commits were added to that repo tomorrow, I request the data once again 2 days later -> we don't get the information about commits made tomorrow because we only hit the db. Unless you want us to have to click the refresh button to fetch from git

Correct me if I understood your proposal wrong.

zergov commented 3 weeks ago

I was not talking about the current way of storing things. It can go once we implement this new suggested system.

Gotcha 👌

If we inspect stats of many git repos, we would be storing them all in the db to my understanding so it would take up much space if we never delete them.

The space it takes is reasonable. My local database of a big project like rails is 50mb, and that's for a large and popular repo that was started 20 years ago. That's very small so I don't think storage is an issue here. Also, storage is the cheapest resource. I will gladly trade storage for speed.

since we will directly access git data from the db if we find them there, we will miss out on more recent commits that we do not fetch from git. e.g. I request data about a repo today, commits were added to that repo tomorrow, I request the data once again 2 days later -> we don't get the information about commits made tomorrow because we only hit the db. Unless you want us to have to click the refresh button to fetch from git

Very gooood observation 👍 I think we can come up with a strategy that will sync the database whenever a user asks for stats about a repo. Some solutions:

If a user asks for stats about github.com/rails/rails, and the repo was synced 5 minutes ago, then we serve whatever we have in our database. We should show the user when was the last sync. If the user knows its out of sync, then that user can just hit the "refresh" button.
If a user asks for stats about github.com/rails/rails, and the repo was sync yesterday, then we should just automatically enqueue a re-indexing job for the missing data, and let the user know that we're syncing the repo in the background. The frontend can be smart enough to poll the API to know when the sync is complete. This should be fairly quick because we just have to sync the delta between the latest commit in our database, and the latest commit on the repo.

zergov commented 3 weeks ago

p.s. data return arrows should be dotted on a sequence diagram :)

Thanks @syw1-art, I updated all the sequence diagrams 👍

syw1-art commented 3 weeks ago

Wow ok since it takes that little space then it's fine, basically makes no difference.

Another solution is maybe we could request github for the last commit date and compare it to the latest commit date we have stored in db on every git repo stats request so we know its synced or not?

This process seems good overall, but what about the other members of the team?

zergov commented 3 weeks ago

Another solution is maybe we could request github for the last commit date and compare it to the latest commit date we have stored in db on every git repo stats request so we know its synced or not?

Yoo I love it 👍

zergov commented 1 week ago

Closing since we agreed we're going in that direction.

visevol / GithubVisualisation

[Architecture] Proposal: Extract repository information in an async job #5

Proposal: Extract repository information in an async job

References