Add PEP 503 repository parsing

pfmoore commented 5 years ago

I'm looking at adding a new module, packaging.pep503 that will add utilities for parsing PEP 503 compliant repositories.

My biggest concern with doing this is that it requires a HTML5 parser. I'd plan on doing this by adding a dependency on html5lib, as there's no point in trying to write our own parser, but it's a pretty significant sized dependency (at least, compared to the packaging library itself). Is there any problem with adding html5lib as a dependency? Should we make it a soft dependency, only needed if the PEP 503 module is used, or is that too much complexity to be worth worrying about?

dstufft commented 5 years ago

I would just add it as a dependency. I don't think we need to worry about a conditional dependency and I certainly don't think that we should reinvent the wheel. If it becomes an issue at some point we can look at splitting things up into sub packages then.

One nitpick, I wouldn't name it packaging.pep503, I've avoided naming things after PEPs here largely because the idea is that if we update how we do task X with a new PEP, ideally we just extend the existing API. It would be weird if something named pep503 was actually PEP 503 + PEP XYZ + PEP ABC etc.

Are you planning on just implementing parsing a str in this API, but not doing any sort of I/O or anything? If that's the case, I think I would call it packaging.repositories.parsing or something like that. That we can add in the I/O portions later to get a more full featured repository API. It might also be fine to just call it packaging.repositories and just glom everything into a single module if we ever add I/O.

Then give the "PEP 503 style" API a name, probably "Simple" to go with the fact that (A) it is simple and (B) on PyPI it is under the /simple/ URL (plus that's basically it's defacto name already), but if you have a better idea that's fine too.

Thus you'd end up with something like packaging.repositories:SimpleRepositoryParser.

Or something like that?

pfmoore commented 5 years ago

Thanks, that all sounds reasonable.

To be honest, I have no intention of ever expanding this to do the I/O - quite the opposite, not doing so is a deliberate design choice. It's sort of a "sans IO" idea - I didn't want to get sucked into making assumptions about how tools would want to handle getting the data - apart from the choice of I/O library (requests, or an async library, or something else) there's also questions of caching, error handling, logging, etc. By making the client provide a function that encapsulates "get me a webpage", all of that complexity can be left with the client, where it belongs.

So I think I'd go with packaging.repositories. And a class SimpleRepository, maybe. I'm not sure I like "parser" in the name, I see it more as a repository object than a parser. I'd planned on having an API something like:

repo = SimpleRepository(base_url, getter)

for project_name in repo.projects():
    proj = repo.project(project_name)

    # Not that useful, but it's a placeholder - future PEPs may add extra attributes
    print(proj.name)

    for file in proj.files():
        print(file.url, file.version, file.hash, file.hash_type,
              file.is_signed, file.requires_python)

dstufft commented 5 years ago

Ah sorry, I thought you were working on a smaller scoped API than that. That's actually a good idea, and something I started working on some proof of concepts back in the day.

Here's the problem, your proposed API can't work asynchronously because your getter needs to do I/O, but there's no place for an await or a callback or something to happen. The general goal is great though. I actually did a number of proof of concepts and had a bit of discussion around this idea previously. You can find those at https://github.com/pypa/packaging/pull/87 (discussion here), https://github.com/pypa/packaging/pull/88, and https://github.com/dstufft/packaging/pull/1.

Feel free to use any of those proof of concepts, or to start from scratch if you think they're going down a wrong path! I do recommend at least reading through the discussion there and reading the three implementations to see how they work. One thing that didn't exist when I wrote those is trio, so it would be really good to think about how you might use the API within trio. I don't know how trio + Deferred work together (maybe they work out of the box?) or Effectfully, but the Sans I/O version should work fine with trio I think.

pradyunsg commented 3 years ago

@pfmoore @brettcannon Is this something that you have feelings on?

IIUC, @brettcannon has worked on https://github.com/brettcannon/mousebender -- specifically mousebender.simple -- since the last update in this issue, and maybe we'd want those bits to be brought into packaging?

pfmoore commented 3 years ago

I don't have many feelings on this, TBH. It's trivial to do with a sufficiently powerful HTML parsing library, and reasonably easy even just with the stdlib (see mousebender). So a library is not that important to me. I'm also now of the view that a sans-IO solution is better so it works with asyncio or traditional io.

mousebender.simple is mostly fine in my view - although there is a practical issue for my particular use case in that it assumes valid data, and some PyPI pages fail because of that (data-requires-python not being a valid specifier IIRC), That's a design question, should we accept invalid data at the cost of not being able to provide "higher level" objects in the API? (And no, I haven't raised an issue against mousebender, mostly because I'm not sure the choice they make is wrong).

Would I use a mousebender-style packaging.index.simple library if it existed? Probably, if I needed to parse an index. Am I interested in writing one? Not so much any more. Would I use something higher-level, of the form I described above? Maybe, but to be honest for my uses the JSON API (and the mirroring part of the XMLRPC API) is far more useful in practice.

brettcannon commented 3 years ago

Both @derek-keeler and I are happy to contribute the code in mousebender for parsing Simple Repository pages to packaging. At this point we are just waiting for packaging to drop Python 2.7 support before we start a conversation about any potential API changes.

pypa / packaging

Add PEP 503 repository parsing #155