Closed encukou closed 8 months ago
Wouldn't this be a bit late for that? We already went through the deprecation period, and removed the feature in the alpha releases, bringing them back now would be a bit confusing.
The
importlib.resources
functions{open,read}_{text,binary}
,path
,is_resource
andcontents
, deprecated in 3.11 and removed in 3.13 alphas, are, anecdotally, missed by quite a few users.
Can you actually show a couple examples of this affecting users downstream? I think that's the most viable argument to bring that API back.
taking multiple path components as positional arguments
Why not just take multiple components with separators in a single argument? It's easy enough to require forward slash, disallow ..
and even to normalise backslashes on Windows if you feel like it.
If they didn't allow subdirectories before (I never noticed, tbh), then presumably using a slash here would have either failed completely or worked. Either way, we can enable them in a new release.
(And add me to the anecdotal list of people who missed them. It's easy enough to add a few lines of code to bring them back, which is how I have been handling it so far, but I'd be happier to have those few lines in the stdlib.)
At least one audience that would like to keep the legacy APIs is in https://github.com/mesonbuild/meson/issues/12401.
I admit, I prefer this approach over keeping the legacy APIs with the cruft that it still had lying around. It adds a mostly-compatible layer and restores these wrappers in a supported way.
On one hand, this approach violates the "preferably one way" to do things; users will need to decide which way works best for them, creating a variety of supported approaches. On the other hand, I do appreciate that it offers a friendlier interface for certain operations (esp. path(...)
).
FFY00 and I put a lot of work into this deprecation process, so it'll be disappointing to now see this reversed at the last minute, but it does feel like the right thing to do, especially since someone else is willing to own the implementation (thanks encukuo!). We will have to backport the change to importlib_resources, but that should be fairly straightforward.
Overall, I'm +0 on the change. I'd really like to see more vocal support from other core devs before committing to this approach.
I've made the encoding argument mandatory for _text
functions when multiple path names are given.
Wouldn't this be a bit late for that?
Yes, sorry. Previously I couldn't commit to supporting this API.
Why not just take multiple components with separators in a single argument?
I'd rather not derail discussion on this issue. Support for separators can be added later if necessary. If they will, allowing multiple arguments will still be useful.
FFY00 and I put a lot of work into this deprecation process
Sorry to hear that. Sunk costs suck :( This makes it seem that implementing the deprecation process was similarly (or more) time-consuming as keeping the API working. That's not a good situation to be in, especially considering all the work users need to put in to update their code.
(And add me to the anecdotal list of people who missed them. It's easy enough to add a few lines of code to bring them back, which is how I have been handling it so far, but I'd be happier to have those few lines in the stdlib.)
I’ll add a “me too” here as well. Being able to do simple things simply is an advantage.
+1 from me on the (updated) proposed API as well as un-deprecating these -- for reasons that have been discussed on the d.p.o thread as well as mentioned here by others.
Their main drawback -- not allowing subdirectories -- can be solved by taking multiple path components as positional arguments, for example:
importlib.resources.read_text('modulename', 'subdirectory', 'subsubdir', 'resource.txt')
I've always sort of wondered why this is a drawback at all, compared to simply doing this:
with importlib.resources.path('modulename.subdirectory.subsubdir', 'resources.txt') as f:
...
I'm not objecting to the new API! It's more ergonomic than pretending everything is a namespace module. But for backwards compatibility with python < 3.13 it seems practical to use the two-argument form, and the lack of a new API doesn't seem like it should have been a killer problem before now.
I don't understand the reason we can't reimplement it as:
def read_text(module, filename, *args, **kwargs): #use proper args if you want here, I just don't know them all off the top of my head
with (path(module) / filename).open("r", *args, **kwargs) as f:
return f.read()
Why do we need the module and filename as multiple args instead of just two?
I'd rather not derail discussion on this issue.
How is it derailing this issue? You're bringing back an API, which I like, and changing the design in a potentially backwards-incompatible way in the process, which I don't. Why is it derailing to ask why it has to have a different design now?
Why do we need the module and filename as multiple args instead of just two?
I'm catching myself up on this. I think the answer is (somewhere) in this thread: https://gitlab.com/python-devs/importlib_resources/-/issues/58
Why do we need the module and filename as multiple args instead of just two?
I'm catching myself up on this. I think the answer is (somewhere) in this thread: https://gitlab.com/python-devs/importlib_resources/-/issues/58
Which was migrated to https://github.com/python/importlib_resources/issues/58.
IIUC:
Traversible.iterdir()
.So there's perhaps four levels of support we could offer for the functional APIs:
traversable.iterdir()
)Personally I'd lean towards option 2. If folks need subdirectory support they can use the OOP API - that's it's whole reason to exist!
Early in the design, I'd deemed it infeasible implement path
for anything that's expecting subdirectories, because what does it mean to get a path
for a directory? And indeed, the replacement, as_file(files(...))
would only allow access to a single file on disk, even though the Traversable API provided access to traverse through a tree of files. Later (and only fairly recently in https://github.com/python/importlib_resources/pull/255), as_file
was expanded to add support for directories (by manifesting a directory and all of its contents in a temporary directory when needed). Until that support was added, the meaning of path()
for a directory would have been broken. I believe that explains why the functional API was intended to be deprecated but only later seems viable to be revived with subdirectory support.
I'm going to backtrack slightly on my support for this. I do find the Traversable
API conceptually complex when all I want to do is "read a file from my package" - the functional API was simpler, and as I said above, I believe that "simple things should be simple to do". So I remain +1 on un-deprecating that API.
The complexities around "add subdirectory support" are where I'm less certain. I like the idea of supporting subdirectories without needing the full Traversable
API, but I'm not entirely happy with the proposal here (specifically the somewhat-weird compromise of making the encoding
parameter mandatory). As an alternative, what's wrong with allowing /
-separated resource names, and translating them into the equivalent Traversable
invocation? This has, by definition, no more security issues than the joinpath
approach, so the problems discussed in the linked issue are addressed at least as well as the Traversable
API addresses them. And it matches what people were reported as expecting to be able to do with the functional API.
So there's perhaps four levels of support we could offer for the functional APIs:
- Deprecate + remove (@jaraco's original plan, already landed)
- Restore old APIs + functionality as it was.
- Restore old APIs, and add support for subdirectories to existing functions (this issue and @encukou's PR)
- Restore old APIs, add add support for subdirectories, including adding new functions where needed (e.g. functional equivalent of traversable.iterdir())
Personally I'd lean towards option 2. If folks need subdirectory support they can use the OOP API - that's it's whole reason to exist!
I would be happy with that.
I will note that option 2 was essentially what a lot of people asked for in the Discourse thread at the time of the original deprecation, and it felt very much to me that positions got entrenched because the comments were possibly too heated. Maybe with the passage of time, and experience of the impact, all we're doing here is confirming that it turned out that the wrong choice was made? Luckily, we still have time to reverse that choice before the functional API actually gets removed.
One issue that was discussed in comments leading up to this comment was that, by adding support for subdirectories, functions like path
and read_text
could be providing access to content outside of the package. The files()
API avoided this issue by leaving it up to the caller to do any traversal. If there are plans for path()
and other calls to accept path parameters, it will re-open those (possibly security-related) concerns.
I don't see how - you shouldn't be accepting untrusted input to the Traversable
API in just the same way as for the functional API. joinpath("..")
is no more secure than a ..
component in a /
-separated path, surely?
I don't see how - you shouldn't be accepting untrusted input to the
Traversable
API in just the same way as for the functional API.joinpath("..")
is no more secure than a..
component in a/
-separated path, surely?
I'm not convinced it's a concern meriting intervention, but it was something that @warsaw raised.
From the linked comment:
I'll assume that the implementation prevents navigating to non-subdirectories; i.e. you can't escape the enclosing resource package directory, or find your way to
/etc
, etc. :)
If I'm right in thinking that files()
returns a pathlib.Path
or zipfile.Path
, then in the former case, nothing prevents users from "escaping" the resource root by calling joinpath('..')
or joinpath('/etc/passwd')
, right?
Maybe, but in cases where that would work, there are easier ways for someone to do it.
Much easier to say that resources shouldn't be looked up using untrusted input. The vast majority of uses will be using static strings here anyway, because arbitrary files do not exist in your own package.
But I believe it uses a different type anyway and only converts to an actual path at the very end, so it may not even be possible to traverse outside of the package. Which makes this entirely a moot point (and would be a great design if so).
I'm pretty sure it produces a traversable that also inherits from Path and can be used as either one (modulo isinstance checks to see whether it's from pathlib or zipfile for cases where the difference matters).
I'd assume that the standard path joining operations use the traversable impl which then checks for attempts to escape the resource root.
Maybe, but in cases where that would work, there are easier ways for someone to do it.
Much easier to say that resources shouldn't be looked up using untrusted input. The vast majority of uses will be using static strings here anyway, because arbitrary files do not exist in your own package.
Right, there's no reason someone couldn't be passing the same inputs to os.path.join and then to builtin open() directly, instead of importlib.resources, so it's basically splitting hairs to say that the resources API needs super special protections for security.
... Also a distraction from the main topic I guess.
@jaraco, do you think allowing path separators is OK? I'm happy to delegate the decision to you.
My opinion, for what it's worth: it is OK.
I don't think we need to normalize slashes or disallow backslashes. I've often wished for a more opinionated resource system that would enforce names that work the same on all platforms (e.g. [a-z0-9]+
segments separated by [-/._]
), as that would eliminate common beginner packaging mistakes. But, importlib.resources
isn't that, and I don't think it should try to be that.
I can see the case for avoiding path traversal attacks (disallowing ..
), though. Tests will catch a platform-specific FileNotFoundError
, but they won't catch a vulnerability.
I prefer allowing multiple *path_parts
, as in pathlib
's joinpath()
, but I'm OK deferring that to a follow-up PR.
But I believe it uses a different type anyway and only converts to an actual path at the very end
In actuality, all importlib.resources.files()
returns in most cases is a pathlib.Path
or a zipfile.Path
. There's no wrapper or additional protections, such as to enforce that only the Traversable
interface is supplied. In fact in python/importilb_resources#291, I explored wrapping the result in something that extended the interface but found it is difficult or impossible to do in general. The Traversable
interface exists mainly for providers to know the minimum interface they must supply, but for stdlib loaders, the Path
classes provide that interface. If we wanted to provide traversal limitations for objects returned from files()
, we'd probably want to augment both pathlib.Path
and zipfile.Path
to supply those limitations and then enable that behavior in files()
.
If I'm right in thinking that
files()
returns apathlib.Path
orzipfile.Path
, then in the former case, nothing prevents users from "escaping" the resource root by callingjoinpath('..')
orjoinpath('/etc/passwd')
, right?
Correct.
@jaraco, do you think allowing path separators is OK?
Yes, I believe it's okay. I believe there's existing precedence for supporting it, in part because 'pathlib.Path' supports it, but also because it's explicitly documented that Traversables should support paths separated by posixpath.sep
(forward slash, platform independent).
I've often wished for a more opinionated resource system that would enforce names that work the same on all platforms (e.g.
[a-z0-9]+
segments separated by[-/._]
), as that would eliminate common beginner packaging mistakes.
importlib.resources
has been somewhat opinionated, as seen above, by supporting one path separator (posixpath.sep
, which often works universally), but unopinionated in that it allows any valid path names or separators. That is, importlib.resources
doesn't want to get into the business of regulating what is a valid resource. If a use-case demands to have names that contain spaces or unicode characters as they're meaningful to the application, they should be allowed to do so.
I prefer allowing multiple
*path_parts
, as inpathlib
'sjoinpath()
, but I'm OK deferring that to a follow-up PR.
Since Traversable.join_path
accepts (*descendants)
, it should be readily supportable. That interface only evolved that behavior somewhat recently, so it may not be available in all Pythons (and thus custom providers may not be reasonably expected to fully support it yet).
One more comment - you may consider contributing this change to importlib_resources simultaneous to or prior to submitting it with CPython as importlib_resources provides a fuller test suite, providing coverage on older Pythons and checks for other aspects like code coverage, type checking, style consistency, and more. Totally your call, though.
Since Traversable.join_path accepts (*descendants), it should be readily supportable.
Oh, that's great! I'll add it to the docs.
The semantics are clear then: I'll delegate to joinpath :)
you may consider contributing this change to importlib_resources
Roger, will send the patch there too.
I've updated the PR to allow path separators.
A PR for importlib_resources
is here: https://github.com/python/importlib_resources/pull/303
Thanks for the discussion everyone, and sorry I couldn't make everyone happy.
I merged the PR. If you disagree with that, please preferably continue the discussion here: https://discuss.python.org/t/11386/47
Apparently the tests fail on big-endian machines; will fix
Feature or enhancement
Proposal:
The
importlib.resources
functions{open,read}_{text,binary}
,path
,is_resource
andcontents
, deprecated in 3.11 and removed in 3.13 alphas, are, anecdotally, missed by quite a few users. They provide a simple API for simple tasks, while the full-featuredTraversable
API is better suited for complex ones -- especially for implementing new resources-aware loaders.I'm now in a position where I can add these functions back and support them.
Their main drawback -- not allowing subdirectories -- can be solved by taking multiple path components as positional arguments, for example:
The additional arguments (encoding and errors) would become keyword-only.
There is a wrinkle in this: in Python 3.9-3.11, the above would mean:
I believe that this is acceptable, since:pragmatically: typical file names do not match typical encoding/errorhandler nameslawyerly: the functions have already been deprecated for 2 releases; no one is using them now, right?However, if this is a problem, I can[edit: This is solved by:]
encoding
argument required if a text-reading function more than one path component is given.Has this already been discussed elsewhere?
I have already discussed this feature proposal on Discourse
Links to previous discussion of this feature:
https://discuss.python.org/t/deprecating-importlib-resources-legacy-api/11386/29
Linked PRs