orhun / git-cliff

A highly customizable Changelog Generator that follows Conventional Commit specifications ⛰️
https://git-cliff.org
Apache License 2.0
8.42k stars 174 forks source link

API rate limit exceeded #422

Open wookayin opened 6 months ago

wookayin commented 6 months ago

First of all, congratulations and thanks for the great work being done on Github integration #363.

Describe the bug

On a large repository where there are a lot of commits, API rate limit is always so I can't generate the release note.

My understanding from #363 is that ALL of the commits in the repository is fetched, so in this way every git cliff attempt will be turned down by an excessive usage of APIs. Even if # of the commits to be contained in a release note is small (e.g. between two minor releases), it's not like the PR information about each commit is fetched but it's like git-cliff is trying to fetch everything.

Also, it's not clear how many requests are (unawarely) being made through git cliff. Can we make it a bit more verbose?

To reproduce

neovim/neovim has ~28k commits as of today.

git clone https://github.com/neovim/neovim && cd neovim

Apply the patch:

diff --git scripts/cliff.toml scripts/cliff.toml
index 3fc10e5d1..bc1f22c56 100644
--- scripts/cliff.toml
+++ scripts/cliff.toml
@@ -1,5 +1,9 @@
 # configuration file for git-cliff (0.1.0)

+[remote.github]
+owner = "neovim"
+repo = "neovim"
+
 [changelog]
 # changelog header
 header = """
@@ -18,12 +22,12 @@ body = """
     ### {{ group | upper_first }}
     {% for commit in commits%}\
         {% if not commit.scope %}\
-            - {{ commit.message | upper_first }}
+            - {{ commit.message | upper_first }} (#{{ commit.github.pr_number }})
         {% endif %}\
     {% endfor %}\
     {% for group, commits in commits | group_by(attribute="scope") %}\
         {% for commit in commits %}\
-            - **{{commit.scope}}**: {{ commit.message | upper_first }}
+            - **{{commit.scope}}**: {{ commit.message | upper_first }} (#{{ commit.github.pr_number }})
         {% endfor %}\
     {% endfor %}
 {% endfor %}\n

Then run:

git cliff --config scripts/cliff.toml v0.9.4..v0.9.5  --github-token <TOKEN>

Expected behavior

No errors.

Screenshots / Logs

Bunch of API limit errors (possibly one log is per page request):

▹▹▸▹▹ Retrieving data from GitHub... (neovim/neovim)

 ERROR git_cliff_core::github    > Request error: {
  "documentation_url": "https://docs.github.com/en/free-pro-team@latest/rest/overview/rate-limits-for-the-rest-api#about-secondary-rate-limits",
  "message": "You have exceeded a secondary rate limit. Please wait a few minutes before you try again. If you reach out to GitHub Support for help, please include the request ID D7CB:1218:1333433:289A206:65902213."
}

Software information

Additional Information

If I may suggest, fetching PR information for only the relevant commits, e.g., via https://api.github.com/repos/neovim/neovim/commits/COMMIT/pulls, would be an alternative, efficient way.

https://docs.github.com/en/rest/using-the-rest-api/rate-limits-for-the-rest-api

welcome[bot] commented 6 months ago

Thanks for opening your first issue at git-cliff! Be sure to follow the issue template! ⛰️

orhun commented 6 months ago

Hello! 🐻

First of all, congratulations and thanks for the great work being done on Github integration

Thank you! I'm glad it is being tested already before the release.

Even if # of the commits to be contained in a release note is small (e.g. between two minor releases), it's not like the PR information about each commit is fetched but it's like git-cliff is trying to fetch everything.

Yes, I mentioned the reasoning behind this briefly in #363 and the problem is generating the list of "new contributors". GitHub doesn't provide an API for easily fetching this so we need all the data for figuring out if someone has contributed to the repository in the past.

I don't know how to solve this yet. One thing we can do for large repos is that use date filters for the GitHub API requests so that we only request the relevant part of the history. But then you won't be able to generate the list for first-time contributors - but you can still list the contributors of the release.

Do you think that would be something feasible?

If I may suggest, fetching PR information for only the relevant commits, e.g., via api.github.com/repos/neovim/neovim/commits/COMMIT/pulls, would be an alternative, efficient way.

This means fetching the pull request for each commit which adds up to a lot of requests as well.

wookayin commented 6 months ago

Thanks for the reply.

For my use cases, I don't need "first-time contributors". It would be great if there's a knob for this. If this is disabled, we might be able to save a lot of unnecessary API calls.

orhun commented 6 months ago

Yes, I agree. I think I send less API requests based on whether "first-time contributors" variable is used in the template and document this behavior.

orhun commented 6 months ago

I took a stab at this at #431, however it doesn't have a significant speed up for the neovim repository. Although git-cliff only requests a time frame of the commits/PRs, it still takes a lot of time due to the number of pull requests. Maybe it makes more sense to use the commit/pulls API instead as you mentioned. But that approach might also take some time in cases where there are not a lot of PRs but commits.

Any ideas?

Also, it's not clear how many requests are (unawarely) being made through git cliff. Can we make it a bit more verbose?

You can see the network requests via giving -v flag while running `git-cliff.

wookayin commented 6 months ago

Thanks, -v is helpful. #431 reduces the number of API calls, but if I understood correctly it still tries to fetch all the PRs given a time range right? However, for me it still gives the secondary API limit error on 11-12 pages (which were requested in < 1 seconds) Guess this might be a throttling issue?

Fetching pull information for each of the commits can be also limited, because the number of commits in a big release can be thousands and we probably don't want to make thousands of pulls API calls.

orhun commented 6 months ago

it still tries to fetch all the PRs given a time range right? However, for me it still gives the secondary API limit error on 11-12 pages (which were requested in < 1 seconds) Guess this might be a throttling issue?

Yes, it needs to fetch all the PRs in the given time range (and there is a lot 😕) There is probably a throttling issue.

Fetching pull information for each of the commits can be also limited, because the number of commits in a big release can be thousands and we probably don't want to make thousands of pulls API calls.

Exactly. :/