paulodiovani / hacktoberrank

Hacktoberfest Rank
https://hacktoberrank-challenge.herokuapp.com/
MIT License
6 stars 17 forks source link

Fetch github in a worker to update databse/cache #7

Closed paulodiovani closed 4 years ago

paulodiovani commented 4 years ago

These issues must be done first:

Create a worker to run once an hour to update our dabase/cache.

arabyalhomsi commented 4 years ago

What about this? So, as we can make up to 30 requests per minute (if authenticated), we can configure our worker to make 30 requests every minute. This is 3000 pull requests per minute. From the start of October until this moment, there were about 753529 pull requests. That is 753529/3000=251 minutes ~ 4 hours to fetch all of them. I think this is good for a start. I can create a worker that does only that until it retrieves all data from start of October. The other part of the worker (or maybe a different worker) would be the one to update the database every hour. @paulodiovani

paulodiovani commented 4 years ago

I can create a worker that does only that until it retrieves all data from start of October. The other part of the worker (or maybe a different worker) would be the one to update the database every hour.

I think we can go straight to update the database every [hour] (interval suggested bellow), because since it will fill the blanks it will already add every missing pull request and it already handles the initial load.

There is no need to have two different jobs/workers.

we can make up to 30 requests per minute (if authenticated), we can configure our worker to make 30 requests every minute. This is 3000 pull requests per minute. From the start of October until this moment, there were about 753529 pull requests. That is 753529/3000=251 minutes ~ 4 hours to fetch all of them.

I'm worried that GitHub may ban our ip or token if we (ab)use it too much. So I suggest the update worker run just enough the fill the database for start (say, every 5 minutes -- this shall get all existing pull requests in a single day) and after that update to run 4 times a day.

paulodiovani commented 4 years ago

Did you check if the created search filter allow a full timestamp?

If so, we can load/update by separating the requests until the created time of the last pr we get.

example:

arabyalhomsi commented 4 years ago

created does not allow a full timestamp, the format has to be YYYY-MM-DD

paulodiovani commented 4 years ago

It says it allows:

You can also add optional time information THH:MM:SS+00:00 after the date, to search by the hour, minute, and second. -- (https://help.github.com/en/articles/searching-issues-and-pull-requests#search-by-when-an-issue-or-pull-request-was-created-or-last-updated)

Note: It is important to add sort=created and order=asc to the search for this to work.

arabyalhomsi commented 4 years ago

Oh! great! must have been too sleepy to read it.

paulodiovani commented 4 years ago

done in #31 :tada: