Closed paulodiovani closed 4 years ago
What about this? So, as we can make up to 30 requests per minute (if authenticated), we can configure our worker to make 30 requests every minute. This is 3000 pull requests per minute. From the start of October until this moment, there were about 753529 pull requests. That is 753529/3000=251 minutes ~ 4 hours to fetch all of them. I think this is good for a start. I can create a worker that does only that until it retrieves all data from start of October. The other part of the worker (or maybe a different worker) would be the one to update the database every hour. @paulodiovani
I can create a worker that does only that until it retrieves all data from start of October. The other part of the worker (or maybe a different worker) would be the one to update the database every hour.
I think we can go straight to update the database every [hour] (interval suggested bellow), because since it will fill the blanks it will already add every missing pull request and it already handles the initial load.
There is no need to have two different jobs/workers.
we can make up to 30 requests per minute (if authenticated), we can configure our worker to make 30 requests every minute. This is 3000 pull requests per minute. From the start of October until this moment, there were about 753529 pull requests. That is 753529/3000=251 minutes ~ 4 hours to fetch all of them.
I'm worried that GitHub may ban our ip or token if we (ab)use it too much. So I suggest the update worker run just enough the fill the database for start (say, every 5 minutes -- this shall get all existing pull requests in a single day) and after that update to run 4 times a day.
Did you check if the created
search filter allow a full timestamp?
If so, we can load/update by separating the requests until the created time of the last pr we get.
example:
created=2019-10-01 00:00:00.000..2019-10-31 23:59:59.999
created=LAST_TIMESTAMP..2019-10-31 23:59:59.999
created
does not allow a full timestamp, the format has to be YYYY-MM-DD
It says it allows:
You can also add optional time information THH:MM:SS+00:00 after the date, to search by the hour, minute, and second. -- (https://help.github.com/en/articles/searching-issues-and-pull-requests#search-by-when-an-issue-or-pull-request-was-created-or-last-updated)
Note: It is important to add sort=created
and order=asc
to the search for this to work.
Oh! great! must have been too sleepy to read it.
done in #31 :tada:
These issues must be done first:
Create a worker to run once an hour to update our dabase/cache.