Open pfmoore opened 5 years ago
I would just add it as a dependency. I don't think we need to worry about a conditional dependency and I certainly don't think that we should reinvent the wheel. If it becomes an issue at some point we can look at splitting things up into sub packages then.
One nitpick, I wouldn't name it packaging.pep503
, I've avoided naming things after PEPs here largely because the idea is that if we update how we do task X with a new PEP, ideally we just extend the existing API. It would be weird if something named pep503
was actually PEP 503 + PEP XYZ + PEP ABC etc.
Are you planning on just implementing parsing a str in this API, but not doing any sort of I/O or anything? If that's the case, I think I would call it packaging.repositories.parsing
or something like that. That we can add in the I/O portions later to get a more full featured repository API. It might also be fine to just call it packaging.repositories
and just glom everything into a single module if we ever add I/O.
Then give the "PEP 503 style" API a name, probably "Simple" to go with the fact that (A) it is simple and (B) on PyPI it is under the /simple/
URL (plus that's basically it's defacto name already), but if you have a better idea that's fine too.
Thus you'd end up with something like packaging.repositories:SimpleRepositoryParser
.
Or something like that?
Thanks, that all sounds reasonable.
To be honest, I have no intention of ever expanding this to do the I/O - quite the opposite, not doing so is a deliberate design choice. It's sort of a "sans IO" idea - I didn't want to get sucked into making assumptions about how tools would want to handle getting the data - apart from the choice of I/O library (requests, or an async library, or something else) there's also questions of caching, error handling, logging, etc. By making the client provide a function that encapsulates "get me a webpage", all of that complexity can be left with the client, where it belongs.
So I think I'd go with packaging.repositories
. And a class SimpleRepository
, maybe. I'm not sure I like "parser" in the name, I see it more as a repository object than a parser. I'd planned on having an API something like:
repo = SimpleRepository(base_url, getter)
for project_name in repo.projects():
proj = repo.project(project_name)
# Not that useful, but it's a placeholder - future PEPs may add extra attributes
print(proj.name)
for file in proj.files():
print(file.url, file.version, file.hash, file.hash_type,
file.is_signed, file.requires_python)
Ah sorry, I thought you were working on a smaller scoped API than that. That's actually a good idea, and something I started working on some proof of concepts back in the day.
Here's the problem, your proposed API can't work asynchronously because your getter needs to do I/O, but there's no place for an await
or a callback or something to happen. The general goal is great though. I actually did a number of proof of concepts and had a bit of discussion around this idea previously. You can find those at https://github.com/pypa/packaging/pull/87 (discussion here), https://github.com/pypa/packaging/pull/88, and https://github.com/dstufft/packaging/pull/1.
Feel free to use any of those proof of concepts, or to start from scratch if you think they're going down a wrong path! I do recommend at least reading through the discussion there and reading the three implementations to see how they work. One thing that didn't exist when I wrote those is trio, so it would be really good to think about how you might use the API within trio. I don't know how trio + Deferred work together (maybe they work out of the box?) or Effectfully, but the Sans I/O version should work fine with trio I think.
@pfmoore @brettcannon Is this something that you have feelings on?
IIUC, @brettcannon has worked on https://github.com/brettcannon/mousebender -- specifically mousebender.simple
-- since the last update in this issue, and maybe we'd want those bits to be brought into packaging
?
I don't have many feelings on this, TBH. It's trivial to do with a sufficiently powerful HTML parsing library, and reasonably easy even just with the stdlib (see mousebender). So a library is not that important to me. I'm also now of the view that a sans-IO solution is better so it works with asyncio or traditional io.
mousebender.simple
is mostly fine in my view - although there is a practical issue for my particular use case in that it assumes valid data, and some PyPI pages fail because of that (data-requires-python
not being a valid specifier IIRC), That's a design question, should we accept invalid data at the cost of not being able to provide "higher level" objects in the API? (And no, I haven't raised an issue against mousebender, mostly because I'm not sure the choice they make is wrong).
Would I use a mousebender-style packaging.index.simple
library if it existed? Probably, if I needed to parse an index. Am I interested in writing one? Not so much any more. Would I use something higher-level, of the form I described above? Maybe, but to be honest for my uses the JSON API (and the mirroring part of the XMLRPC API) is far more useful in practice.
Both @derek-keeler and I are happy to contribute the code in mousebender for parsing Simple Repository pages to packaging
. At this point we are just waiting for packaging
to drop Python 2.7 support before we start a conversation about any potential API changes.
I'm looking at adding a new module,
packaging.pep503
that will add utilities for parsing PEP 503 compliant repositories.My biggest concern with doing this is that it requires a HTML5 parser. I'd plan on doing this by adding a dependency on
html5lib
, as there's no point in trying to write our own parser, but it's a pretty significant sized dependency (at least, compared to the packaging library itself). Is there any problem with addinghtml5lib
as a dependency? Should we make it a soft dependency, only needed if the PEP 503 module is used, or is that too much complexity to be worth worrying about?