github pagination support

JensTimmerman commented 10 years ago

GitHub returns a maximum of 100 items, the rest of the response can be gotten by using one of the links in the headers, especially the link to the next page, with the rest of the results. https://developer.github.com/guides/traversing-with-pagination/

It would be nice if there was some sort of general way of getting all results from a call to the github api (or an option to specifiy this) so that we could get all results instead of the first X results, and no headers on how to get to the rest.

I was thinking of a PaginatedClient class?

toolforger commented 9 years ago

This would be nice to have, but if you retrieved a large-ish commit log (say, that of the Linux kernel), you'd instantly blow your rate limit. So you have to limit retrieval anyway, and work in chunks; 100 items per chunk sounds reasonable, so GitHub is already giving you what you need.

wpoely86 commented 9 years ago

@toolforger The idea is not to grab all commits in one go but to do it in chunks, like the api suggests. This should however be transparent to the user, hence the PaginatedClient class suggestion.

toolforger commented 9 years ago

Follow the next URL. It's often available in the request body. Where it is not, it is in the Link header of the HTTP response; agithub does not expose it currently, but I happen to be working on exactly that.

wpoely86 commented 9 years ago

Yes, that's the point of this issue. Have agithub do this for us so that it is transparent for the user.

If you require a tester, give shout :wink:

JensTimmerman commented 9 years ago

@toolforger iirc the problem is that the link isn't in the request body, but in one of the headers, which is not exposed by agithub at the moment. But if you're working on that, great, indeed, please let us know if you need some testing.

toolforger commented 9 years ago

I'll just submit a PR :-)

toolforger commented 9 years ago

Actually the headers are already available through g.getheaders(). Its semantics is questionable (and I'm going to submit a PR for that), but it works.

toolforger commented 9 years ago

Meh. Turns out I can't change the API without breaking all callers, so getheaders it is and remains. So no PR is going to come forth.

BTW doing pagination support would mean multiple HTTP requests, but getheaders would be able to return only the headers from the last request, with no way to recover earlier headers. And these can be quite relevant (e.g. etag and date, needed for saving on the rate limit by caching previous results).

wpoely86 commented 9 years ago

@toolforger can you elaborate a bit more what the problem is exactly?

So, realistically, the only option is the parse the header ourself and make the 'next' request.

toolforger commented 9 years ago

The problem is that I wanted to return the headers as an additional tuple element, so we could write status, response, headers = g.rest.path.get(), but this would break existing status, response = g.rest.path.get() code (Python adapts tuples for comparisons, but not for assignments it seems). So... incompatible change, can't be done before 2.0.

The header doesn't need parsing, it's already a list of key-value pairs, which is nice because I can do dict(g.getheaders()) to turn this into something useful (the list form is better if you handle all items anyway and need performance, but that's not my use case). Unfortunately, GitHub seems to return a single links header with next and prev (and maybe more values) all lumped into a single string that still needs parsing. But at least it's just one parse :-) (Update: See https://developer.github.com/v3/#link-header for details; letter case seems to be Link, not link.)

jpaugh commented 9 years ago

@toolforger We can start a 2.0 branch anytime. As far as returning the headers with each request: the reason I didn't do it this way was to simplify the case where the headers aren't needed. I'm still ambivalent, but wouldn't mind changing it.

jpaugh commented 9 years ago

Concerning links parsing/Pagination: This is out of scope for the API proper, but putting it in a contrib package or something would be okay. If it requires adding hooks to the main classes, that's fine, too; as long as they don't do anything by default, of course.

AGitHub was designed expressly to be agnostic to the idiosyncrasies of the upstream protocol, without introducing a new fuzz-layer atop it. AGitHub acts just like upstream, whether your upstream is GitHub or Facebook. That is its raison d'être, and its best feature, IMO.

(In fact, the name AGitHub is now something of a misnomer, since it supports other upstream servers equally well; I had planned to separate out the generic parts of it into a new library, but haven't gotten around to it.)

toolforger commented 9 years ago

On accessing headers, I'd say that returning (status, response, headers) is not really a problem for 2.0; callers can still say status, response, _ = g. ... .get() and ignore the headers.

On making the library adaptable, the request results should be passed through a hookable function. (Subclassing would work too, but it's usually better to put adapter logic into a separate response transformer function, it's easier to combine multiple transformers that way.)

On being agnostic, you should probably add a mission statement what the library does. A library that's agnostic about everything is just http after all... so, what does it doe above and beyond what http offers? One thing that comes to mind is the attribute-to-HTTP-path transformation; there's more code in the class but I didn't look at that too closely. And maybe defining the library's mission by enumerating the services is a too low-level perspective,

gene1wood commented 6 years ago

Here are some other projects that use agithub, solving pagination in their code. Dropping links here as possible inspiration for adding this functionality to the GitHub module in agithub

Though it sounds like what should really be done is not what these examples do of parsing the URLs to extract the page number and passing that as an argument but instead to just call the Hypermedia link relations directly.

hwine commented 6 years ago

This quick hack [1] works nicely for me. But it is a hack! Usage:

for repo in ag_get_all(gh.orgs.github.repos.get):
    print(repo["full_name"])

[1] https://gist.github.com/e4384145127a915b96849537bef3b93d

gene1wood commented 6 years ago

I've got a branch now with this functionality and I'm just finishing the unit tests. As soon as it's done I'll put a PR up for it. I've been using the functionality for a few months now.

toolforger commented 5 years ago

I have moved on to other tasks so this doesn't help me anymore, but thanks anyway :-)

mozilla / agithub

github pagination support #10