python / cpython

The Python programming language
https://www.python.org
Other
62.76k stars 30.08k forks source link

submodule of c-extension module is quirky #87533

Open mattip opened 3 years ago

mattip commented 3 years ago
BPO 43367
Nosy @mattip, @skoslowski, @YannickJadoul
Files
  • test.c: Complete C module to reproduce the issue
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = None closed_at = None created_at = labels = [] title = 'submodule of c-extension module is quirky' updated_at = user = 'https://github.com/mattip' ``` bugs.python.org fields: ```python activity = actor = 'skoslowski' assignee = 'none' closed = False closed_date = None closer = None components = [] creation = creator = 'mattip' dependencies = [] files = ['49844'] hgrepos = [] issue_num = 43367 keywords = [] message_count = 2.0 messages = ['387917', '387935'] nosy_count = 3.0 nosy_names = ['mattip', 'skoslowski', 'YannickJadoul'] pr_nums = [] priority = 'normal' resolution = None stage = None status = 'open' superseder = None type = None url = 'https://bugs.python.org/issue43367' versions = [] ```

    Linked PRs

    mattip commented 3 years ago

    If I have a module "parent", and I add another module "child" with a method "f" to it:

    child = PyModule_Create(...);
    PyModule_AddObject(parent, "child", child);

    then I can call

    import parent
    parent.child.f()

    but importing like this

    from parent.child import f

    raises a ModuleNotFoundError: ... 'parent' is not a package

    This came up in PyTorch https://github.com/pytorch/pytorch/issues/38137 and in pybind11 https://github.com/pybind/pybind11/issues/2639, and in various other places like stackoverflow https://stackoverflow.com/questions/38454852/importerror-with-error-is-not-a-package

    A complete example is attached

    If this is intentional, it might be nice to emit a warning when calling PyModule_AddObject with a module.

    7ea7bd46-8b17-41e9-b3fb-92551bb0d289 commented 3 years ago

    >> import parent.child

    first imports "parent" (successfully) but then fails, because the import code has no knowledge of were to find ".child". This is because a) the module "parent" is not marked as a package (hence the error message) b) even if it were a package, there is no (ext) module file to locate and load.

    If you instead run

    > from parent import child

    only "parent" is imported, and "child" is retrieved as an attribute - which it actually is. The import code itself will add such an attribute, too [1]. However, that is after the submodule was located and loaded. Attribute lookup on the parent is not part of the submodule import itself.

    A (hacky) work-around would be to add an entry for "parent.child" in sys.modules. This prevents the import machinery from running.

    A (more) proper solution would be to mark "parent" as a package and install some importlib hooks. See [2] for an example from Cython-land.

    Finally there is PyImport_AppendInittab() [3], which could possibly be used to register "parent.child". I have never tried that. Even if this worked, it would be unsupported and probably not without side-effects.

    [1] https://docs.python.org/3/reference/import.html#submodules [2] https://stackoverflow.com/questions/30157363/collapse-multiple-submodules-to-one-cython-extension [3] https://docs.python.org/3/c-api/import.html?highlight=inittab#c.PyImport_AppendInittab

    LostTime76 commented 11 months ago

    Bump

    Can somebody please explain the proper usage of the C API such that we can create actual packages and submodules from C code that are lazy loaded (Py_AppendInittab)?

    Edit:

    After some spelunking in the cpython code to find where the "is not a package" error was being thrown, it appears it was not happy with the fact that the parent package did not have the path dunder attribute set. As a test on the parent, I put a module exec slot on the parent and set the path attribute in the exec function to None. Then I created another module definition for a few sub modules of the parent package and appended them using the PyImport_AppendInittab function and dot notation.

    PyImport_AppendInittab(parent) PyImport_AppendInittab(parent.sub) PyImport_AppendInittab(parent.sub.sub)

    The imports now work in a test python script -> import parent.sub.sub. No python, pure C code embedding python. Neat. Now I can lazy load with sub module support. Not sure if this is intended.. but unless somebody comes in here and says otherwise, I will run with this.

    I guess the recommendation here would be to set an exec slot on every module and set the path dunder to None. Additionally, set other attributes like @mattip's OP to make sure the python accesses work as expected. This can all be done lazily as a python script requests the modules instead of brute force creating all the modules up front. After all, lazy loading (PyImport_AppendInittab) was my goal here.

    ringohoffman commented 4 months ago

    @mattip if I understand the original source of this correctly:

    The issue is really about pickle not being able to serialize attributes of these C-modules, not so much that you can't import these C-modules directly. I am experiencing exactly the same situation, again coming from an issue serializing attributes of PyTorch C-modules.

    Consider a roughly analogous situation:

    import pickle
    import sys
    
    assert pickle.sys.settrace is sys.settrace
    
    import pickle.sys  # ModuleNotFoundError: No module named 'pickle.sys'; 'pickle' is not a package

    I think it is perfectly reasonable for import pickle.sys to raise an error (even if the error message is wrong).

    I think a pure-Python example of what we are seeing with the C-modules is this:

    my_module.py:

    import pickle
    import types
    
    eval_frame = types.ModuleType("eval_frame")
    
    def reset_code():
        ...
    
    reset_code.__module__ = "__main__.eval_frame"
    
    eval_frame.reset_code = reset_code
    
    with open("path.pt", "wb") as f:
        pickle._Pickler(f).dump(pickle.sys.settrace)  # OK
    
    with open("reset_code.pt", "wb") as f:
        pickle._Pickler(f).dump(eval_frame.reset_code)  # _pickle.PicklingError: Can't pickle <function reset_code at 0x7f2383e651f0>: import of module '__main__.eval_frame' failed

    Like pickle.sys, eval_frame is a module but not a package, and so is not importable with import my_module.eval_frame. The main difference here is that pickle doesn't try to import pickle.sys when serializing pickle.sys.settrace because pickle.sys.settrace.__module__ == "sys", and pickle has no trouble importing sys from the global scope.

    For eval_frame (and our C-modules), this is not the case. It is a module with no corresponding globally importable package. my_module.eval_frame.reset_code.__module__ == "my_module.eval_frame", and you can only access eval_frame thru my_module. This would equate to from my_module import eval_frame.

    By tweaking pickle._Pickler.save_global, and pickle._Unpickler.find_class we can achieve this from ... import ... behavior, maintaining backward compatibility while also supporting the above example and serializing C-module attributes:

    class _Pickler:
        ...
    
        def save_global(self, obj, name=None):
            ...
    
            module_name = whichmodule(obj, name)
            try:
                _module_name, *fromlist = module_name.rsplit(".", 1)
                module = __import__(_module_name, fromlist=fromlist, level=0)
                if fromlist:
                    module = getattr(module, *fromlist)
                obj2, parent = _getattribute(module, name)
            except (ImportError, KeyError, AttributeError):
                ...
            ...
    
    class _Unpickler:
        ...
    
        def find_class(self, module, name):
            ...
            module_name, *fromlist = module.rsplit(".", 1)
            module = __import__(module_name, fromlist=fromlist, level=0)
            if fromlist:
                module = getattr(module, fromlist[0])
            if self.proto >= 4:
                return _getattribute(module, name)[0]
            else:
                return getattr(module, name)

    I am eager to solve this problem, and so I welcome feedback on my understanding and proposed solution. Please let me know.

    serhiy-storchaka commented 1 week ago

    Try to set sys.modules['parent.child'] = child after creating the child module. Would it help?

    BTW, it is not recommended to use PyModule_AddObject(), which is broken by design. Use PyModule_AddObjectRef() or PyModule_Add(), depending whether you want to keep a reference to the added object.

    serhiy-storchaka commented 1 week ago

    I guess the PR was closed due to conflicts with recent changes in the pickle module (there was some refactoring), and due to a lack of response from my side.

    Thoughts about this idea:

    On other hand, if setting sys.modules works, we can leave this. There may be the same issue in other serializers -- yaml, xmlrpc.

    I have no strong preference. We may try different approaches and see what will happen.