pypi / stdlib-list

A list of Python Standard Libraries (2.6-7, 3.2-13).
https://pypi.org/project/stdlib-list/
MIT License
127 stars 32 forks source link

`typing.{io,re}` missing in `3.10` #117

Open gmesch opened 6 months ago

gmesch commented 6 months ago

The modules typing.io and typing.re are missing in 3.10. They are present only in 3.8 and nowhere else, even though they exist at least in 3.10, albeit created in an awkward way.

woodruffw commented 6 months ago

Thanks for the report @gmesch! Could you say a bit more about the "awkward way"? This may be something we are able to patch in the module collection script.

gmesch commented 6 months ago

Could you say a bit more about the "awkward way"?

I only read the 3.10 implementation of 3.10, in file python-3.10.14/lib/python3.10/typing.py in the python install tree. There, the submodules re and io are created as classes inside the typing.py file, but are kept off the __all__ collection of the typing module. This is referred to as "pseudo submodules" in the comments:

# The pseudo-submodules 're' and 'io' are part of the public                                                            
# namespace, but excluded from __all__ because they might stomp on                                                      
# legitimate imports of those modules.                                                                                  

Here is the code that creates the io module, and inserts it into sys.modules:

class io:
    """Wrapper namespace for IO generic classes."""

    __all__ = ['IO', 'TextIO', 'BinaryIO']
    IO = IO
    TextIO = TextIO
    BinaryIO = BinaryIO

io.__name__ = __name__ + '.io'
sys.modules[io.__name__] = io
woodruffw commented 6 months ago

Thanks for digging into it!

Hmm, this is indeed pretty awkward -- our generation tooling uses inspect.ismodule to walk through all stdlib packages, and in this case typing.io is really a weird class namespace thing, not a module.

>>> import typing.io
>>> import inspect
>>> inspect.ismodule(typing.io)
False

I'm not 100% sure to do about that -- io is "behaving" like a module here, but it empirically is not one. This may just be something we have to document as an explicit limitation.

miketheman commented 6 months ago

I also poked at this a bit, read some docs, and the best idea I could come up with was to snapshot sys.modules before and after an import, and collect any additions to sys.modules after, but didn't actually write any code to do it - so no clue if that would work either. 😀

Missing typing.io was originally noted in https://github.com/pypi/stdlib-list/issues/7#issue-266050946 and the documentation of policy is still outstanding, per #80

woodruffw commented 6 months ago

Yeah, I should really write that policy :sweat_smile:. I've got some time today, so I'll send a PR for it in a bit.

I also poked at this a bit, read some docs, and the best idea I could come up with was to snapshot sys.modules before and after an import, and collect any additions to sys.modules after, but didn't actually write any code to do it - so no clue if that would work either. 😀

I think this would work, although it still wouldn't pass the inspect.ismodule test -- we'd need to loosen the check to "anything that might appear in sys.modules regardless of type", and I don't know enough about Python's module system to know whether this is sensible to do :slightly_smiling_face:

woodruffw commented 6 months ago

@gmesch I'm thinking about ways to address this. One possibility is us adding a new API, something like in_stdlib_namespace, that would essentially boil down to a string prefix check on the input against the list of known stdlib modules. In other words:

>>> in_stdlib("typing.io")
False
>>> in_stdlib_namespace("typing.io")
True

This would make things like typing.io detectable, but with a number of caveats (no guarantee that it's actually a module, no guarantee that it actually exists, etc.). Would this kind of new API satisfy your usecases, or is it too generic?

gmesch commented 6 months ago

Would this kind of new API satisfy your use cases

We use this in a tool that computes the dependencies of python programs on pip packages from the import statements in the python source files. Since every python executable depends on the python interpreter and with it the standard library, an import of anything in the standard library does not imply a dependency on a pip package. (And this in turn we use this to keep the deps declarations in bazel BUILD files up to date.)

So if just matching the prefix would be correct, we don't need a new API for that, we can just check all prefixes of an imported module name, in addition to the full module name, using the current API.

However, I think that would be wrong, because I think that a pip package can supply its modules in a namespace package that shares a path prefix with modules in the standard library. I.e. I think it would be legitimate if e.g. a pip package typing-foo supplies code in the namespace typing.foo. If that's true indeed, then the prefix check of a module to determine inclusion in the standard library would be wrong.

gmesch commented 6 months ago

FWIW, the approach to import each file found in the standard library and capture the delta in the sys.modules map before and after import seems promising to me, and is closest to the semantics I am looking for in the use case described above.

I.e. I just want to know whether an import statement can be satisfied against only the python interpreter install tree as it comes from the python distribution, or whether additional pip packages are necessary for such import to work.

gmesch commented 6 months ago

Btw. I also took note of the hint in the documentation,

you probably don't need this library. See sys.stdlib_module_names and sys.builtin_module_names for similar functionality.

But I could not quite decide which of the two would be right, and I detected already this discrepancy:

Python 3.10.14 (main, Apr 25 2024, 01:03:18) [GCC 9.4.0] on linux
>>> from urllib.parse import urlparse
>>> from stdlib_list import stdlib_list
>>> libs = stdlib_list("3.10")
>>> import sys
>>> 'urllib.parse' in libs
True
>>> 'urllib.parse' in sys.stdlib_module_names
False
>>> 'urllib.parse' in sys.builtin_module_names
False

I doublechecked that the file python-3.10.14/lib/python3.10/urllib/parse.py does indeed exist in my python interpreter install tree.

So it did not seem to be straightforward to just use sys.

woodruffw commented 6 months ago

Thanks for the responses @gmesch!

However, I think that would be wrong, because I think that a pip package can supply its modules in a namespace package that shares a path prefix with modules in the standard library. I.e. I think it would be legitimate if e.g. a pip package typing-foo supplies code in the namespace typing.foo. If that's true indeed, then the prefix check of a module to determine inclusion in the standard library would be wrong.

Yeah, this is unfortunately true:

>>> sys.modules['os.lol'] = object()
>>> from os import lol
>>> lol
<object object at 0x1023f9e90>

This is arguably something that packages should never do, but Python is too dynamic to prevent it. Notably, this also means that any amount of import analysis is always imperfect, since a package can do this:

>>> import sys
>>> sys.modules['os'] = object()
>>> import os
>>> os
<object object at 0x1049305e0>

In other words, you can't (perfectly) infer that a package isn't loaded just because it wasn't imported by a non-stdlib import statement. Ideally Python would forbid this and it should never appear in real code anyways, but I have no evidence to substantiate an assertion that real code doesn't do this 😅

So it did not seem to be straightforward to just use sys.

Yeah, stdlib_module_names in particular is restricted to top-level names:

For packages, only the main package is listed: sub-packages and sub-modules are not listed. For example, the email package is listed, but the email.mime sub-package and the email.message sub-module are not listed.

So unfortunately you can't use it for specific sub-packages/modules 🙁 -- it's really just meant for a top-level check.

TL;DR: I'm not of aware of a sound way to guarantee that import foo s.t. foo in stdlib ensures that only CPython source code is required. In practice however, I think the namespace inclusion check is correct > 99.999% of the time. But this may not be sufficient for your use case 🙂