meltano / squared

Where the Meltano team runs Meltano! Get it???
25 stars 6 forks source link

bug: github-search rate limiting #409

Open pnadolny13 opened 1 year ago

pnadolny13 commented 1 year ago

I added PRs and issues streams to the github search EL and were starting to hit the rate limits and the Airflow job wont complete. We do a search for taps and targets so every result gets a bunch of follow on requests for PRs/Issues/Readme content and I think our result set is too big to ever complete within the rate limit ranges. Even with incremental loads we still hit ever repo every time for updates and if we have 5k repos then our 5k request limit is used up quickly.

  1. Limit search query further
  2. Add more auth tokens
  3. Split EL jobs somehow
  4. Increase Airflow's retry time to 1 hr
  5. Add a configurable feature to tap-github like throttle_requests to stay within rate limit. If the limit is reached it should sleep vs hard failing.

Challenges with each:

  1. I can figure out how to do this. Github's search seems to be very inexact. For example our non-fork target search criteria brings back https://github.com/andabi/deep-voice-conversion and a top result and includes many taps. I'm only retrieving non-forks for now until this issue is resolved.
  2. the tap accepts a list of auth tokens which would help but its a user level rate limit so we need auth tokens from multiple accounts. Idk how we'd do that and mange them.
  3. This seems like a hack, we'd expect it to continue failing every time but rely on airflow's retries to allow it to eventually finish.
  4. Same as above
  5. It would work but then theres wasted compute just hanging around waiting for the rate limit to reset. So thats not ideal.

@aaronsteers any thoughts on this since you've worked with this tap before and run into similar problems?

aaronsteers commented 1 year ago

@pnadolny13 - We're working with hourly rate limits, correct?

Can we split the streams and queries that we need to run, so that they run in alternating hours?

I think that's the first thing I'd consider and then (if we think this can finish within 2-3 hours' limits) I'd lean towards trying to add sleep/wait like the throttle_requests approach. The compute cost for 3 hours of execution should be pretty minimal, and is par for other larger datasources at some companies.

Neither is a silver bullet and I think there may be other good options as well. Maybe a good topic for Data Office Hours?