pytorch / data

A PyTorch repo for data loading and utilities to be shared by the PyTorch domain libraries.
BSD 3-Clause "New" or "Revised" License
1.13k stars 152 forks source link

[BE] Add lazy_import #95

Open ejguan opened 3 years ago

ejguan commented 3 years ago

🚀 Feature

Currently we used import in each DataPipe class to make lazy importing happens like https://github.com/pytorch/data/blob/4802a350d1bf954d0785e0f22fd1fcde2deded76/torchdata/datapipes/iter/load/iopath.py#L21-L29

As more potential libraries used in TorchData to support different functionalities, we could add a methods to support lazy import module to global namespace. Then, we don't need to duplicate the import inside each class used the same third-party module.

Features needed:

erip commented 2 years ago

I'm happy to take this up.

It seems like if you return the lazily loaded module from some function, you don't actually need to support as .... For example, you could do something like np = lazy_import("numpy") where you'd normally do this check.

One question about the lazy loading is how lazy we want to get. I understand the intent of this issue to effectively be a way to wrap up the try-except import and cache the module (or parent). If so, I guess the logic would basically be:

def lazy_import(module, error_msg, submodule=None):
    if not module_installed(module, submodule):
        raise ModuleNotFoundError(error_msg)
    # check if module (and submodule if provided) are in the "cache"
    if not module_imported(module, submodule):
         import(module, submodule)
     # return the module (or submodule) to be used in code
     return load_module(module, submodule)

Is this what you had in mind, @ejguan? If so, there are probably some other logistics to address like what to do with parent modules which might be polluting the namespace, etc. If we're only worrying about modules and submodules, I think we could do this all with importlib, though submodules can be a bit hacky at times...

ejguan commented 2 years ago

@erip

Thanks for taking interest in this issue.

I understand the intent of this issue to effectively be a way to wrap up the try-except import and cache the module (or parent)

This would cover partial requirement.

I think the ideal situation would be import module/submodule to global namespace lazily. And, when the methods from such module or submodule are invoked, the actual module is imported. By this mean, we don't need to assign the module to each DataPipe instance self to expose such modules into __iter__ function.

Current:

class XXXDataPipe:
    def __init__(self, ...):
        try:
            import abc
            self._abc = abc
        except:
             raise Error

    def __iter__(self):
        self._abc

Ideally:


class XXXDataPipe:
    def __init__(self, ...):
        abc = lazy_import("abc")  # Inject into global namespace

    def __iter__(self):
        abc.xxx  # I am not sure if mypy is going to allow such behavior
erip commented 2 years ago

I see. The failure to import would result in an immediate error, but the actual import is the lazy part. Is that right? In other words, attribute access on not-installed modules should never happen?

ejguan commented 2 years ago

In other words, attribute access on not-installed modules should never happen?

Yeah, you are right. The Error should be raised at the first place. Otherwise, it should be a bug.

erip commented 2 years ago

Excellent, thanks for the clarification. One idea that's kind of messy is using globals() to populate the module with as_ or fully qualified name. The issue is that it will then be a global everywhere and not just within the file. For instance, in this minimal example

def lazy_import(name, as_=None):
    # Doesn't handle error_msg well yet
    import importlib
    mod = importlib.import_module(name)
    if as_ is not None:
        name = as_
    globals()[name] = mod

class Foo:
    def __init__(self):
        lazy_import("numpy", as_="np")

    def foo(self):
        return np.array([1,2,])

f = Foo()
print(f.foo()) # prints [1, 2] as expected

Using this, I think it might introduce some collisions or a heavily polluted global namespace.

ejguan commented 2 years ago

You are right. Then, let's keep it as the minimum as possible. Not using global for now. If more users requested, we can easily extend it.

erip commented 2 years ago

So you want to stick with the current, but wrap up the try/except logic?

class XXXDataPipe:
    def __init__(self, ...):
        self._abc = lazy_import("abc")

    def __iter__(self):
        self._abc
erip commented 2 years ago

I asked here and it seems like pandas has something similar to what we want. They still need to import where used so it's not terribly different, but it prevents us from needing to stuff modules into self or have globals floating around.

ejguan commented 2 years ago

I asked here and it seems like pandas has something similar to what we want. They still need to import where used so it's not terribly different, but it prevents us from needing to stuff modules into self or have globals floating around.

Sounds great.