glato / emerge

Emerge is a browser-based interactive codebase and dependency visualization tool for many different programming languages. It supports some basic code quality and graph metrics and provides a simple and intuitive way to explore and analyze a codebase by using graph structures.
MIT License
783 stars 46 forks source link

Possibly fixes #29 #30

Closed glato closed 1 year ago

glato commented 1 year ago

@and3rson This possibly fixes #29 in checking if the dependency can be found in sys.modules and if so - setting global_import to True. This seems to solve the reproducible example you've described above (see my screenshot). Also tried it on a bigger codebase (Django), as one would assume it reduces nodes and edges.

CleanShot 2022-10-19 at 20 23 29@2x
and3rson commented 1 year ago

@glato thanks for an update!

I did some testing and it fixes some imports for built-in modules. However there's still some difference in how modules' "global" status is identified - for example, yaml is considered global but pkg_resources is not. I believe it's due to emerge using sys.modules from its own environment, which is different than the environment of the analyzed project. So I assume emerge itself imported yaml at some point, which added it to sys.modules.

Detecting whether something is a global module is definitely a really tricky task. But I have one more idea: what if instead of sys.modules we used a different way to check which modules are available? Quick googling gave me two suggestions:

Those commands should ideally return modules that are installed in current environment, but the downside is that they seem to return package names (like python-louvain or aws-lambda-typing), not actual importable names (like python_louvain or aws_lambda_typing). A cheap trick might be to simply replace dashes with underscores... But I'm still not sure how accurate the output of these commands would be. The list itself should be accurate though, especially when emerge is installed in the same virtualenv as the analyzed project, since pkg_resources & pip would analyze installed modules from this virtualenv. Finally, due to Python's dynamic nature we'll likely never get 100% accuracy here anyway (because people like to do weird things like sys.path.append('../../../far/away') or sys.modules['woot'] = 1337), so it seems like the only way to nail this is to make it overridable (see next "Update" section). Let me know what you think!

Update: we could also add some sort of local_modules / global_modules parameter to emerge's config file to explicitly define fully-qualified names of what modules are global or local instead of automating this, or this could be an addition to our automatic algorithm which will allow users to fine-tune the false-positives.

glato commented 1 year ago

@and3rson Thanks for the testing, the research and your valuable approaches 👍. I could imagine that a good solution would be to combine automatic detection and the option of overwriting a subset of the detection or extending it by configuration. just as you described in your Update section, this would allow users to fine-tune and provide greater flexibility. Will try to come up with an updated PR in the following days 🚀.

glato commented 1 year ago

@and3rson Sorry for the small delay, after some additional research and testing I did the following: pyparser initially starts to autodetect and build a set of "global" imports (self.create_autodetect_set()) that should not be resolved further (but treated as global). This is implemented as a combined approach that includes collecting metadata (if found) by using pkg_resources, adding processed pip/freeze dependencies as a second part and as a last step adding some filtered modules from sys.modules to also include basic built-in modules e.g. os or sys. After generating this set, the auto detection tries to setup a final assumption if the dependency should be considered to be resolved or not by first checking the auto detection set from the previous steps and then considering if one of the override options from the configuration is in place (all happens in dependency_is_global(...)). So as proposed above - the user has the power of auto detection but still can configure everything in a fine-granular way by using the following config option:

...
override_resolve_dependencies:
- some_dependency_that_must_be_resolved
- another_dependency
override_do_not_resolve_dependencies:
- os
- sys
- another_built_in_dep_that_was_missed_by_the_auto_detection
...

I also merged your last changes from dev.

I tested this by running some scans on the Django source, would love additional feedback if you can find some time for testing 👍.

and3rson commented 1 year ago

Hi, thank you so much for your work!

I've did some testing and found another caveat: seems like imports like from lib import xxx and from lib.xxx import yyy are treated slightly differently. However I was able to fix that by using the override_do_not_resolve_dependencies that you've provided, which is awesome! Here's an example project where this reproduces:

# sample/services/__init__.py
from starlette.requests import Request
from sample.utils import add_numbers

# sample/utils/__init__.py
from starlette.requests import Request

The graph is as follows: image

However once I've added this to the config, emerge was able to properly determine this as a global import:

analyses:
- analysis_name: py check
  source_directory: ./sample/
  override_do_not_resolve_dependencies:
  - starlette.requests

...and the graph now properly shows starlette as a global package, yay! image

I think this might be related to dependency_is_global checking for an entire string (starlette.requests) which is absent in global_depenency_autodetect_set (because it only contains starlette). Same happens for things like from google.cloud.logging_v2.handlers import .... I think it should be possible to simply do dependency = dependency.partition('.')[0] at the start of dependency_is_global. But good news is that even without this fix, this issue still can be remediated by using override_do_not_resolve_dependencies!

Oh, and one more very minor issue: there's a typo in global_depenency_autodetect_set (depenDency is missing "D"). :)

UPDATE 1: I've tried adding dependency = dependency.partition('.')[0] myself, and dependency_is_global now returns True for starlette.requests, ~however this is not reflected on the output - the renderer graph still shows it as "samples/starlette/requests"~ this fix works and the graph is properly showing the import as starlette.requests (without sample/ prefix), so I can now remove starlette.requests from override_do_not_resolve_dependencies!

UPDATE 2: I've just ran emerge with my small fix (.partition('.')[0]) on a large codebase and every single global dependency resolved properly!

UPDATE 3: I was sometimes getting duplicate prefixes for the analyzed application package itself (e. g. sample/sample/xxx/yyy/zzz), an easy fix for this was to add the application package name to override_do_not_resolve_dependencies:

override_do_not_resolve_dependencies:
- sample

UPDATE 4: Now that I added most of third-party libraries to ignore_dependencies_matching / ignore_dependencies_containing and the list started growing (and basically becoming a mirror of requirements.txt), I've started wondering if it would be possible to add some ignore_global_dependencies bool flag to automatically exclude all global dependencies from the graph. What do you think? (Still, I think that's outside of the scope of this PR, so I'm sorry if I'm bringing too much noise!)

glato commented 1 year ago

@and3rson Big thanks for your extensive testing, your clever solution proposals and ideas 🙂. Just fixed the small typo in global_depenency_autodetect_set. As this PR already has some nice improvements, I will merge that directly to add value.

I really like your idea about ignoring all global dependencies (if autodetected of configured by override). That would declutter the graph from 3rd party dependencies and increase the focus on the inherent complexity of a project. Very interesting. I will see if I can implement and add this feature for an upcoming release 👍.

Could you maybe create a small PR with your .partition('.')[0] solution? It sounds like this would be another great contribution we shouldn't miss 🚀.