src-d / metadata-retrieval

Apache License 2.0
4 stars 10 forks source link

Handle rate limit errors #3

Closed carlosms closed 5 years ago

carlosms commented 5 years ago

Right now the DownloadRepository method exits if the rate limit is exceeded. Maybe we want to implement a similar strategy to the ghsync client: when we find this error, wait until the rate limit gets reset.

I say maybe because there might be cases where it makes sense for the library to just return immediately. For example, if the app using the lib has other auth tokens to try.

se7entyse7en commented 5 years ago

Linking here multi-client version used for demo, just in case it might help. This version was supposed to be just a working prototype and not to be clean enough to be included.

se7entyse7en commented 5 years ago

BTW that version is not parallel, it just switches client once one is exhausted.

AFAIR using 5 tokens serially never made ghsync sleep as once a client is exhausted, it recovers during the usage of the others. So I think that adding parallelization is worth depending on the order of magnitude of the number of tokens that we plan to use. Said so, using 5 tokens in parallel will probably lead the application to sleep as it will exhaust all the clients nearly at the same time, making the effort of having parallel code useless.

smacker commented 5 years ago

@se7entyse7en do you think it makes sense to analyze the distribution of organizations by the number of entities? If we see that with one token we can actually fetch 50% of organizations without hitting the limit it would definitely make sense to have parallelization. As right now with the office network I hit the limit only after ~30 minutes. But if there is like 1% of orgs can be fetched this way and the rest will hit the limit anyway it wouldn't make much sense.

Also we could write a blog post using the data :)

smacker commented 5 years ago

small research: https://gist.github.com/smacker/13452c3c2b839c5bdf31656fa7cb5e0f

se7entyse7en commented 5 years ago

This is really cool! I didn't know the GHTorrent project 👍.

BTW by thinking at this issue, I think that parallelization actually worths independently from the number of tokens and resources. Because even if parallelization could lead the application to sleep, it will still grant us that we use them asap. On the other hand, running them serially could make some tokens be unused for a while when there's actually no reason to wait.

smola commented 5 years ago

You can check https://github.com/src-d/identity-matching They did handle quite a few corner cases there.

smacker commented 5 years ago

@smola ghsync handles all the cases correctly, we just need to port the logic from there.

se7entyse7en commented 5 years ago

This is related: #18.

we just need to port the logic from there.

yup, and also we need to remember that is a bit different as we're not using go-github anymore here.

se7entyse7en commented 5 years ago

BTW I guess that we want to handle both rate limit and abuse errors in the same way right? I mean this and #18 would be closed together. Or am I missing something?