Importing a single object from a module results in duplicate imports

fabiosangregorio commented 4 years ago

Hello, I love the project. It is the only one that works well out of the box and with minimal configuration.

Currently, importing a single object from a module results in a duplicate import in the dependency graph:

the parent module
the child module

E.g.:

# in test_services.py
from telereddit.services.services_wrapper import ServicesWrapper

results in the following dependency graph:

I would expect the graph to not present the services node, because I'm not importing it.

In general, all modules (even if not imported directly) are present and generate duplicate imports arrows. Am I using it wrong, is this intended behavior or is this a bug?

thebjorn commented 4 years ago

Hi Fabio and thank you for you interest in pydeps.

You are actually importing services.. let me explain:

Given the following files:

(dev) go|c:\srv\tmp\issue15> tree
.
|-- services
|   |-- __init__.py
|   `-- services_wrapper.py
`-- test_services.py

where __init__.py contains:

(dev) go|c:\srv\tmp\issue15> cat services\__init__.py
print("services/__init__.py")

services_wrapper.py contains:

(dev) go|c:\srv\tmp\issue15> cat services\services_wrapper.py
print("services/services_wrapper.py")

class ServicesWrapper:
    pass

and test_services.py contains:

(dev) go|c:\srv\tmp\issue15> cat test_services.py
from services.services_wrapper import ServicesWrapper

Running the test file, produces:

(dev) go|c:\srv\tmp\issue15> python test_services.py
services/__init__.py
services/services_wrapper.py

showing that the __init__.py file is indeed imported (which is the services module dependency that you're seeing).

fabiosangregorio commented 4 years ago

Ok, it is intended behavior then, thanks for taking the time to explain this.

Is it the best practice to show these dependencies even if every __init__.py is empty?

Also, is there a way to prevent showing these "main" modules (e.g. services)? I tried running it with exclude = *__init__* but the node is still shown.

thebjorn commented 4 years ago

Those are good questions, and I could see a use-case for ignoring __init__.py files that do not import anything...

There is no direct way to exclude __init__.py files (although I suppose you could do some post-processing of the .dot output).

gsakkis commented 4 years ago

I stumbled on this just today and I was about to open a similar issue. :+1: for adding an option (e.g. --source-imports) to produce edges only between the modules/packages that are explicitly specified as imports in the source, not all the packages that are actually imported at runtime. This should happen regardless of the contents of __init__.py. I'd even go a step further and suggest this to be the default behavior as it generates more compact graphs, and perhaps have the current behaviour be enabled with an option (e.g. --runtime-imports).

Great project btw!

thebjorn commented 4 years ago

@gsakkis thanks for the input.

pydeps is internally using a (modified) version of the standard-library module modulefinder which looks for import opcodes in they Python byte code - ie. it neither looks at runtime (e.g. by trying to import the module and look at sys.modules or install an import-hook) nor the source code. This brings with it some disadvantages (we don't catch especially dynamic imports) and some advantages (we catch both branches of a python-version switch).

Skipping __init__.py files can have unintended consequences, and I believe the current flow comes from fixing a problem with the collections module being excluded (cf. https://github.com/thebjorn/pydeps/commit/1ab3837be41c87f6af055c56188b5a528c5e587a).

I'm pro both configurability and sensible defaults, but I'm not convinced that this behavior (at least not in a simplistic version) is a sensible default.

Removing modules that do not import anything - and have submodules (perhaps provided the removal doesn't disconnect the graph) would potentially be an interesting flag...

gsakkis commented 4 years ago

@thebjorn an optional flag for the proposed behavior (either on or off by default) would be great!

andreapesare commented 1 year ago

Hi! Any update on this? @thebjorn

thebjorn commented 1 year ago

@andreapesare The problem is as follows... given the code above, the following graph is produced:

and the output of --show-deps is:

(dev37) go|c:\srv\tmp\issue57> pydeps test_services.py --show-deps
{
    "__main__": {
        "bacon": 0,
        "imports": [
            "test_services"
        ],
        "name": "__main__",
        "path": null
    },
    "services": {
        "bacon": 2,
        "imported_by": [
            "test_services"
        ],
        "name": "services",
        "path": "c:\\srv\\tmp\\issue57\\services\\__init__.py"
    },
    "services.services_wrapper": {
        "bacon": 2,
        "imported_by": [
            "test_services"
        ],
        "name": "services.services_wrapper",
        "path": "c:\\srv\\tmp\\issue57\\services\\services_wrapper.py"
    },
    "test_services": {
        "bacon": 1,
        "imported_by": [
            "__main__"
        ],
        "imports": [
            "services",
            "services.services_wrapper"
        ],
        "name": "test_services",
        "path": "c:\\srv\\tmp\\issue57\\test_services.py"
    }
}

and the simple solution would be to remove any nodes that (i) have a path ending in __init__.py and (ii) no `imports" key. __init__.py files can have code in them however, so it's not immediately obvious to me that it would be correct to exclude them. Can we just check if the __init__.py file is empty? Certainly, but they can also contain docstrings, comments, constants, etc., so a simple open(fname).read() == "" isn't enough...

Maybe a --prune=emptyish-init-files or similar..?

PRs are always welcome ;-)

gsakkis commented 1 year ago

As I mentioned before, the empty(ish)ness of __init__.py is not the issue and an option that addresses just that would be a hack at best. The issue is that

import services.services_wrapper

and

import services
import services.services_wrapper

create the same graph. My proposed solution would be a (say) --source-imports flag that

distinguishes these two pieces of code by generating a different dependency graph for each one and
produces the same output for each code regardless of whether services/__init__.py is empty(ish) or not.

Basically this option would answer the question "what module(s) does test_services imports directly".

thebjorn commented 1 year ago

Well... In both cases services is the only name that is directly imported:

def tst():
    import services
    import services.services_wrapper

import dis
dis.dis(tst)

outputs:

  2           0 LOAD_CONST               1 (0)
              2 LOAD_CONST               0 (None)
              4 IMPORT_NAME              0 (services)
              6 STORE_FAST               0 (services)

  3           8 LOAD_CONST               1 (0)
             10 LOAD_CONST               0 (None)
             12 IMPORT_NAME              1 (services.services_wrapper)
             14 STORE_FAST               0 (services)
             16 LOAD_CONST               0 (None)
             18 RETURN_VALUE

the second import just overwrites the name services in your namespace.

thebjorn commented 1 year ago

@gsakkis I suppose an ast-based approach could work, based on very limited data:

node = ast.parse("""
import services
import services.services_wrapper
""")

imports = [n for n in node.body if isinstance(n, ast.Import)]
names = set()
for imp in imports:
    for alias in imp.names:
        names.add(alias.name)

print(names)

output: {'services.services_wrapper', 'services'}

node = ast.parse("""
import services.services_wrapper
""")

output: {'services.services_wrapper'}

and

node = ast.parse("""
import services
""")

output: {'services'}

Not sure what a sensible approach would be to incorporate this in pydeps though, without parsing the code in the entire dependency graph...

thebjorn commented 1 year ago

A more robust finder might be:

def find_imports(fname):
    names = set()

    with open(fname) as fp:
        txt = fp.read()

    body = ast.parse(txt).body

    for node in body:
        if isinstance(node, ast.Import):
            for alias in node.names:
                names.add(alias.name)
        if isinstance(node, ast.ImportFrom):
            module = node.module
            for alias in node.names:
                names.add(f'{module}.{alias.name}')

    return names

but there seems to be a lot of cases to cover (besides the fact that this only looks at the top level of the source file), e.g.:

print('-'*20)
print("""import services
import services.services_wrapper""")
print(ast.dump(ast.parse("""
import services
import services.services_wrapper
""")))

print('-'*20)
print("from services import services_wrapper")
print(ast.dump(ast.parse("""
from services import services_wrapper
""")))

print('-'*20)
print("""from services.services_wrapper import ServicesWrapper""")
print(ast.dump(ast.parse("""
from services.services_wrapper import ServicesWrapper
""")))

print('-'*20)
print("""from .services.services_wrapper import ServicesWrapper""")
print(ast.dump(ast.parse("""
from .services.services_wrapper import ServicesWrapper
""")))
print('-'*20)

with output:

--------------------
import services
import services.services_wrapper
Module(body=[Import(names=[alias(name='services', asname=None)]), Import(names=[alias(name='services.services_wrapper', asname=None)])])
--------------------
from services import services_wrapper
Module(body=[ImportFrom(module='services', names=[alias(name='services_wrapper', asname=None)], level=0)])
--------------------
from services.services_wrapper import ServicesWrapper
Module(body=[ImportFrom(module='services.services_wrapper', names=[alias(name='ServicesWrapper', asname=None)], level=0)])
--------------------
from .services.services_wrapper import ServicesWrapper
Module(body=[ImportFrom(module='services.services_wrapper', names=[alias(name='ServicesWrapper', asname=None)], level=1)])
--------------------

gsakkis commented 1 year ago

@thebjorn indeed an ast-based approach is closer in what I (and I'm guessing the OP) would expect from a module dependency analyzer. Unfortunately, a pure ast-based analyzer would not be sufficient for all cases as you show because it cannot distinguish from some_package import some_submodule_or_subpackage from from some_package_or_module import some_class_or_function In the first case the dependency is after the import (some_submodule_or_subpackage), in the second it's before (some_package_or_module). I personally think it's an unfortunate design decision that Python has two (almost) equivalent ways to import modules (import package.module as module and from package import module) but it does, so that's what we have to work with.

thebjorn commented 1 year ago

@gsakkis pydeps already knows which imports are "files", so that particular issue is solvable. pydeps is pretty fast(*) because it uses modulefinder. An ast-based approach would have to build tree-structures for all python sources which would be much more work.

Also, for larger projects, the main goal is to reduce the number of nodes since it is impossible to work with the resulting graphs as-is. E.g. this is about 2% of the django graph (pydeps django --cluster):

IOW, adding/removing nodes for this use-case, or spending the time analyzing for it, would be pretty much wasted.

Perhaps if the graph was small enough there could be a second ast-pass that removed nodes that were artifacts of the Python module import system... just thinking out loud here. I don't think there is any way to filter "interesting" nodes, since we're no longer just talking about __init__ files.

FWIW, the case that causes the __init__ files to be listed is:

from collections import OrderedDict

which is implemented entirely in the __init__ file and was (in an earlier version) not included in the graph.

(*) graphviz is slow though, e.g. on django, pydeps analyzes the code in less than 10 seconds on my machine, but it takes several minutes to generate the svg...

exterm commented 1 year ago

@thebjorn I'm trying to use pydeps for network analysis of software architecture. Thank you for building and maintaining it. It's a great tool. I parse the output and load it into networkx for further analysis.

adding/removing nodes for this use-case, or spending the time analyzing for it, would be pretty much wasted.

I'd like to argue that adding or removing nodes matters even in large graphs, since it changes the structure on the micro level, which is relevant to metrics such as clustering coefficients and modularity. And swaths of other standard network metrics.

Specifically, the behavior described in this issue increases clustering considerably by adding more nodes that have common neighbors with existing nodes.

I might end up writing a script that takes pydeps output and checks it against the AST. But of course it would be preferable if pydeps' output reflected the graph expressed by the author of the code, which, as far as I can tell, is not currently the case (but I'm not a python expert).

exterm commented 1 year ago

OK - did some research, and apparently pydeps correctly mirrrors python's documented behavior here. However, most python devs don't seem to know this and inadvertently create lots of circular imports by putting submodule imports into init files.

Not a problem with pydeps then.

thebjorn / pydeps

Importing a single object from a module results in duplicate imports #57