pypa / setuptools

Official project repository for the Setuptools build system
https://pypi.org/project/setuptools/
MIT License
2.36k stars 1.15k forks source link

find_packages() doesn't find PEP 420 packages #97

Closed ghost closed 5 years ago

ghost commented 10 years ago

Originally reported by: gwideman (Bitbucket: gwideman, GitHub: gwideman)


#!python

setup(...
    packages=find_packages(..)
    )

on the developer machine will fail to find packages that lack a __init__.py file, as is allowed in Python 3.3. However, such packages listed explicitly: packages=['mypkg'] do appear to get included and later installed.

Note: When testing this, before each test be sure to delete all generated metadata, including that which setup may previously have placed in the original source directory, as it seems that setup may use metadata created on a previous run in order to include files.

This is part of a general problem reported in issue #83, but I've logged it separately as it's specifically about setuptools.


ghost commented 10 years ago

Original comment by jaraco (Bitbucket: jaraco, GitHub: jaraco):


Removed notice about PEP 420 package finder. Discussion in the ticket (Ref #97) indicates that the implementation is currently inadequate.

ghost commented 10 years ago

Original comment by gwideman (Bitbucket: gwideman, GitHub: gwideman):


@pje

the net effect of having a namespace package inside a non-namespace package was to make the subpackage a non-namespace package

OK, but then there seems to be no way to specify, either to Python, or to package-user programmers, that a subdir is not intended to be a package.

update to the PEP is advisable I have been leaning that way for some time, but as a non-luminary I don't carry much weight :-).

If this comes up for more deliberation, I do advocate that the overall package behavior be not just deterministic and implementable, but also comprehensible and explicable as a whole. That is, the whole of the __init__.py and non-__init__.py should be described as a coherent whole, and not as two distinct things whose interactions and side effects are left to the imagination.

Nick Coghlan proposed an "explanation requirement" with respect to directly executing a module from a package. I think it applies to the whole package concept. https://mail.python.org/pipermail/import-sig/2012-March/000423.html

ghost commented 10 years ago

Original comment by pje (Bitbucket: pje, GitHub: pje):


It was once briefly discussed in relation to PEP 402 (420's predecessor) that the net effect of having a namespace package inside a non-namespace package was to make the subpackage a non-namespace package, since its path is limited to one item (the subdirectory of the parent). But neither PEP specs this explicitly, it's just implied. There didn't seem to be any reason to restrict this.

Another issue that's come up recently on distutils-sig is that nspkg.pth support is broken for PEP 420 currently -- the magic really shouldn't be running if PEP 420 support is available.

I'm beginning to think that an update to the PEP is advisable, to both specify how to detect support, and spelling out package detection heuristics in a better way. At any rate, the discussion needs to expand beyond the ticket tracker, probably to import-sig and distutils-sig.

ghost commented 10 years ago

Original comment by gwideman (Bitbucket: gwideman, GitHub: gwideman):


Could someone clarify whether there was explicit thinking/spec regarding whether PEP420 package rules apply within a package that has an
__init__.py file? From Wyatt's earlier comment and blog post, I believe he thinks that PEP420 applies even within a __init__.py-marked directory tree, at least so far as import's behavior. He finds that surprising (as do I), and proposed heuristics that disable that behavior. Was there some deliberation on that we can read about? I'd rather be persuaded that it's a good thing than whinge about it in ignorance :-).

ghost commented 10 years ago

Original comment by pje (Bitbucket: pje, GitHub: pje):


Heh. If you think this is hard in Python, you should check out JavaScript's module systems, or lacks thereof. ;-)

But no, essentially, there is no way to be unambiguous without adding __init__.py, and it's still recommended that you do, unless you're using namespace packages.

And at the moment, setup.py and distribution metadata are supposed to catalog the existence of same. So that is the only other way to know: use .egg-info or .dist-info. Any packages that don't use __init__.py and aren't declared in distribution info can't be picked up unambiguously.

ghost commented 10 years ago

Original comment by gwideman (Bitbucket: gwideman, GitHub: gwideman):


"that there is no a priori way to tell that an implicit namespace package is a namespace package from the file system alone"

That's the conclusion I've come to as well.

I know a lot of smart people put a lot of thought into this. I've read some of the reams of discussion on import-sig (notably including Nick Coghlan's https://mail.python.org/pipermail/import-sig/2012-March/000423.html and subsequent).

But I am puzzled how the package concept is viable if it lacks a way for packages to specify (explicitly or implicitly) their own structure unambiguously. I didn't reach any explanation of how everything that interacts with packages (programmers, IDEs, tools) is intended to work with packages that lack a way to unambiguously specify their structure. Is there one?

ghost commented 10 years ago

Original comment by pje (Bitbucket: pje, GitHub: pje):


A PEP 420 package is a namespace package - that's the whole concept behind PEP 420.

One thing that was discussed in relation to PEP 420 and its predecessors on python-dev and import-sig, is that there is no a priori way to tell that an implicit namespace package is a namespace package from the file system alone. If you have a directory tree ./foo/bar/baz/spam.py, where ./foo and ./foo/bar don't have any importable files, then it's possible that foo.bar.baz and its parents are all namespace packages. So you'd have to scan all possible subdirectories to unlimited depth (while detecting symlink loops) to correctly "detect" all possible namespace packages... and then you can have false positives.

In the presence of this ambiguity, the tools should refuse the temptation to guess.

(Also, if you do make find_packages() guess, you'll have a counter-problem, where Distribution will have to find out on its own that those packages are namespace packages -- my guess is that they won't be handled properly in all installation cases if you don't. In other words, guessing will be required on both sides of the fence!)

ghost commented 10 years ago

Original comment by jaraco (Bitbucket: jaraco, GitHub: jaraco):


@pje excellent suggestion to use namespace_packages directly.

I do think, however, that PEP420 doesn't just apply to namespace packages but to any package, in which case the use of namespace_packages only serves as another improvement in the heuristic.

ghost commented 10 years ago

Original comment by pje (Bitbucket: pje, GitHub: pje):


I forgot to mention/may not have made this clear: currently, namespace packages must be explicitly listed in the namespace_packages argument to setup(), which means there is no real reason from an end-user perspective to have find_packages() detect them. That's why it would be simplest to change setuptools.dist:check_nsp to automatically add the namespace packages to the list of packages, instead of giving an error -- at that point, nobody should even notice that find_packages() isn't finding the packages in question. (Though the docs should still be updated to clarify that any __init__-less packages must still be manually listed in namespace_packages.)

ghost commented 10 years ago

Original comment by pje (Bitbucket: pje, GitHub: pje):


The easiest way to fix this problem is to simply auto-add namespace_packages to the packages list in the Distribution object. find_packages should not be used for this, because of the possibility of false positives.

In other words, if you want to include namespace packages, list them in namespace_packages, and don't list them in the regular packages. This means that find_packages should only find traditional packages, not namespace packages.

The explicit listing of namespace_packages needs to happen for backward compatibility anyway, and setuptools already installs those without __init__.py on older Pythons.

ghost commented 10 years ago

Original comment by jaraco (Bitbucket: jaraco, GitHub: jaraco):


Case in point is the recent refactorings I made to the find_packages function.

I wanted to be rid of the 'stack' concept (which mutated a list for the purposes of iteration) and the 'out list' (which initialized an empty list and populated it incrementally for iteration). I wanted instead something that leveraged clean iteration over the essential objects.

Once I did that though, it becomes obvious why a include_ns_packages parameter in find_packages would significantly complicate the implementation.

ghost commented 10 years ago

Original comment by jaraco (Bitbucket: jaraco, GitHub: jaraco):


Especially when I see this:

def do_something(switch=False):
  if switch:
    do_one_thing()
  else:
    do_another()

Of course, it's usually more complicated than that, but that function could be re-written to just have the caller call the function he cares about.

More importantly, an important lesson I've taken from cleanroom development, functional programming, and test-driven design is that functions should be small, isolated, and with as few branches and side-effects as possible. Adding a boolean flag that creates in internal branch ties the external interface to a branch in the internal behavior. This means that to comprehend fully the meaning of the flag, one must comprehend the entire function and all of the states it might be in.

Proper variable naming can help indicate intent, but it won't overcome the increasing cognitive burden of the behavior.

This becomes increasingly apparent when one attempts to refactor such a function only to find that the single boolean parameter must be passed unused to reach its inner function.

I realize my explanation probably sounds esoteric, but it stems from my attempt to explain what seems intuitive to me after reading books like Refactoring.

ghost commented 10 years ago

Original comment by gwideman (Bitbucket: gwideman, GitHub: gwideman):


@"Jason R. Coombs":

In general, I avoid parameters, especially simple booleans, that select behavior

I don't have a preference, but I'm curious what is the rationale behind your avoidance?

ghost commented 10 years ago

Original comment by jaraco (Bitbucket: jaraco, GitHub: jaraco):


In general, I avoid parameters, especially simple booleans, that select behavior. I'd prefer accepting/passing strings or actual functions to select varying behaviors.

But even if one assumes a keyword argument as Wyatt suggests, I'd be inclined to have an alias where the default is True.

ghost commented 10 years ago

Original comment by wyatt (Bitbucket: wyatt, GitHub: wyatt):


I was just thinking about the idea of adding a keyword arg, too. It would only be relevant/active on Python 3.3 and up. Perhaps it would be disabled by default on 3.3 too so that people who aren't using namespace packages--which is probably the common case--wouldn't need to think about it.

Something like find_packages(include=['mypkg'], include_ns_packages=True) seems like a decent API. And for projects that aren't using namespace packages in 3.3 and up, find_packages() with no args would work just like it does today.

ghost commented 10 years ago

Original comment by gwideman (Bitbucket: gwideman, GitHub: gwideman):


@Jason R. Coombs: "defer some extra weight [to me]" Hahaha -- that's a lot of responsiblity!

First, I thought I posted this additional comment, but perhaps not:

Is adding an optional named argument to the existing function to select the 420-inclusive behavior a way forward that avoids the naming issue? Or is this beyond the pale for some reason?

If a new function is the way to go, then I certainly see the reasonableness of all your points.

If long names are out, then personally I lean strongly to the find_420_package name. Like I mentioned, it precisely captures the intent of the function, and its "technical" appearance correctly conveys that there's some nuances to understand.

I strongly dislike the names that change just "find" to "locate" or "discover".

I would also be quite happy with find_package_ex, or find_package2. These aren't beautiful, but at least they convey that this function is newer in some way than the old one.

ghost commented 10 years ago

Original comment by jaraco (Bitbucket: jaraco, GitHub: jaraco):


@gwideman I do agree with your sentiment. I am concerned about the counter sentiment which is that this function is meant to be a convenience function, providing a shortcut for specifying the project names explicitly (and sometimes redundantly). A secondary advantage is that the function adds to the simplicity and elegance of the setup function. Having a function name like 'find_distribution_and_namespace_packages', while some environments might auto-complete, many (most?) won't. And furthermore, for a function like this, the aesthetic is relevant. A clumsy name, while functional, would detract from use (and thus the appeal of setuptools).

Let me elaborate on my thoughts about the alternative names. Yes, they do mirror find_packages semantically without reflecting the disparity. This aspect was by design, with the intention that (a) it would be documented clearly (probably with a show-by-default deprecation warning), and (b) the new function would entirely supersede the old name. So, I'm somewhat reluctant to have a new function carry the baggage of a legacy function that's to be deprecated.

That said, you make some good arguments, and you've pushed me closer to the 'find_420_packages' name.

I should mention, I've considered 'find_all_packages' but rejected it because 'all' is vague yet still clumsy. I've also considered the Microsoft technique, find_packages_ex, but that's clumsy too and even more vague.

Given this ticket/PR is your initiative, I'm inclined to defer some extra weight to your judgment, so please consider these aspects and give me your favorite (or two).

Thanks again for your help on this.

ghost commented 10 years ago

Original comment by jaraco (Bitbucket: jaraco, GitHub: jaraco):


Did you mean 3.3?

Yes

ghost commented 10 years ago

Original comment by gwideman (Bitbucket: gwideman, GitHub: gwideman):


@ Jason: Where function names suggest an overlap of functionality, I strongly favor that the distinctions in the names actually indicate what the functional distinctions are.

Cases where functions are distinguished only by arbitrary differences in their names tend to present endless confusion, searches of docs, and so on. The difference between "find", "locate" and "discover" does not pertain to scope, which is the salient difference in functionality.

Also, having two overlapping functions with names that are not alphabetically adjacent (in docs and autocomplete) misses the opportunity to alert the unwary programmer that there's a choice to be made.

If the distinctness of meaning is: "find_distribution_and_namespace_packages", then I, for one, am in favor of calling it that. It's long, but we have copy/paste and IDE autocomplete.

If that's considered too long, then "find_420_packages", while somewhat cryptic, is at least precise. If it suggests cruft, that's perhaps a good thing -- that here some complexity lies, please be fully informed of the nuances of this function.

Side question: "once Python 3.2 is the baseline" <-- I'm puzzling what's the salience of 3.2 to this topic, or did you mean 3.3?

ghost commented 10 years ago

Original comment by jaraco (Bitbucket: jaraco, GitHub: jaraco):


After further consideration on this issue, particularly around backward-compatibility, I'm inclined to back out the changes as published and instead create a new function, designed to supersede find_packages, with PEP-420 support.

This approach would give opt-in capability for projects that use PEP-420 package structures, but would leave the existing functionality unchanged.

I've been brainstorming over what this new function could be called. I'm leaning toward "discover_packages", "locate_packages", or "find_420_packages". The latter provides better clarity of its original purpose, but will appear like cruft in the future (once Python 3.2 is the baseline).

ghost commented 10 years ago

Original comment by arfrever (Bitbucket: arfrever, GitHub: arfrever):


But include argument cannot be used if a project wants to be compatible with older Setuptools...

ghost commented 10 years ago

Original comment by wyatt (Bitbucket: wyatt, GitHub: wyatt):


The heuristics are tricky. Here's what I came up with (my original patch included similar heuristics):

As far as excluding certain hard coded directories, it seems like that shouldn't be necessary, since that's basically the purpose of the new include arg. I.e., you say you want to include 'mypkg*' and then the sibling directories of mypkg (like build and docs) will automatically be excluded.

In addition, you can use a little trick to exclude non-package directories that are inside a package: add an extension to them (e.g., mypkg/scaffolds.data). These will be excluded automatically by find_packages(), and they're also not importable.

I also wrote a blog post about this: http://wyattbaldwin.com/2014/02/07/pep-420-namespace-packages-somewhat-surprising/.

ghost commented 10 years ago

Original comment by arfrever (Bitbucket: arfrever, GitHub: arfrever):


I suggest to skip directories matching some hardcoded patterns by default (unless such directories are matched by element of include parameter):

ghost commented 10 years ago

Original comment by arfrever (Bitbucket: arfrever, GitHub: arfrever):


+1 for not including directories without any modules.

ghost commented 10 years ago

Original comment by jaraco (Bitbucket: jaraco, GitHub: jaraco):


By the way, I don't think there's a "big plan" regarding packages. PEP-420 was the plan and where it's deficient, more discussion should probably happen in python-dev.

I was there for the implementation of PEP-420. I feel like it's a good thing to simply and directly allow folders to represent packages. In the past, I was always bothered by the fact that it took two entities (a directory and an __init__ file) to represent a package.

Regardless, the implementation for Python 3.3 and 3.4 is set, so setuptools should support those models to the best of its ability. I think for find_packages, it might be worthwhile to do more heuristics, namely -- only inferring a directory is a package if it has modules somewhere within it.

ghost commented 10 years ago

Original comment by jaraco (Bitbucket: jaraco, GitHub: jaraco):


By more explicit - I mean that other package maintainers will have to do something like I've done in 78c8cfbe3e10.

ghost commented 10 years ago

Original comment by gwideman (Bitbucket: gwideman, GitHub: gwideman):


Jason:

If we stick with the generous (and probably more correct) implementation to accept any directory as a package, package maintainers are going to have to be a lot more explicit about which directories are packages. It almost renders find_packages useless.

I think you are reaching a level of head scratching similar to my 2013-11-05 comment. It seems to me that PEP 420's implications weren't fully thought through, as it undermines the ideas about what the package concept facilitates and what it protects against (not just causing problems for find_packages() ).

Perhaps someone at a higher pay grade could distill a little better what the big-picture goal is for packages, and figure out whether PEP 420 gets there. To that end, it might be worthwhile recording in more detail the implications that you've encountered.

For example, your discussion of package maintainers needing "to be a lot more explicit" -- what form do you envision that taking?

ghost commented 10 years ago

Original comment by jaraco (Bitbucket: jaraco, GitHub: jaraco):


Add support for PEP 420 namespace packages to find_packages()

On Python 3.3+, find_packages() now considers any subdirectory of the start directory that's not a regular package (i.e., that doesn't have an __init__.py) to be a namespace package. Because this will often include non-package directories, a new include argument has been added to find_packages().

include can make it easier to narrow which directories are considered packages instead of having to specify numerous excludes. In particular, it's an easy way to keep top-level, non-source directories from being considered packages.

The other way this supports PEP 420 is by making sure __pycache__ directories are never added to the list of packages.

Refs issue #97

ghost commented 10 years ago

Original comment by jaraco (Bitbucket: jaraco, GitHub: jaraco):


I've merged the pull request, but I find that the output includes quite a few directories not previously covered. Consider the comparison on setuptools itself:

> python
Python 3.4.0rc3 (v3.4.0rc3:8a81cdab3e9d, Mar  9 2014, 20:24:50) [MSC v.1600 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import setuptools
>>> setuptools.find_packages()
['build', 'command', 'dist', 'docs', 'foo', 'setuptools', 'temp', 'tests', '_markerlib', 'build.lib', 'docs.build', 'docs._templates', 'docs._theme', 'setuptools.command', 'setuptools.tests', 'tests.shlib_test', 'build.lib.setuptools', 'build.lib._markerlib', 'docs.build.doctrees', 'docs.build.html', 'docs._theme.nature', 'setuptools.tests.indexes', 'setuptools.tests.svn_data', 'build.lib.setuptools.command', 'build.lib.setuptools.tests', 'docs.build.html._sources', 'docs.build.html._static', 'docs._theme.nature.static', 'setuptools.tests.indexes.test_links_priority', 'setuptools.tests.indexes.test_links_priority.simple', 'setuptools.tests.indexes.test_links_priority.simple.foobar']

> py -2.7
Python 2.7.6 (default, Nov 10 2013, 19:24:24) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import setuptools
>>> setuptools.find_packages()
['setuptools', '_markerlib', 'setuptools.command', 'setuptools.tests']

I think perhaps the example above using the heuristic of testing for the presence of *.py files may be better. On the other hand, Python will import any directory.

> py
Python 3.4.0rc3 (v3.4.0rc3:8a81cdab3e9d, Mar  9 2014, 20:24:50) [MSC v.1600 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.listdir('foo')
[]
>>> import foo
>>>

If we stick with the generous (and probably more correct) implementation to accept any directory as a package, package maintainers are going to have to be a lot more explicit about which directories are packages. It almost renders find_packages useless.

Some more consideration is going to have to go into this idea.

ghost commented 10 years ago

Original comment by jaraco (Bitbucket: jaraco, GitHub: jaraco):


Add support for PEP 420 namespace packages to find_packages()

On Python 3.3+, find_packages() now considers any subdirectory of the start directory that's not a regular package (i.e., that doesn't have an __init__.py) to be a namespace package.

The other way this supports PEP 420 is by making sure __pycache__ directories are never added to the list of packages.

Fixes issue #97

ghost commented 10 years ago

Original comment by wyatt (Bitbucket: wyatt, GitHub: wyatt):


I just pushed two more alternative approaches. The first uses __import__ to determine which directories are packages, and the second works similarly to the existing version but also considers all directories under where to be namespace packages when running under Python 3.3+.

ghost commented 10 years ago

Original comment by wyatt (Bitbucket: wyatt, GitHub: wyatt):


I'll see if I can find some time tomorrow to write some tests and submit a PR.

ghost commented 10 years ago

Original comment by jaraco (Bitbucket: jaraco, GitHub: jaraco):


Nice. The patch looks quite good.

Would you consider putting together a test as well?

ghost commented 10 years ago

Original comment by wyatt (Bitbucket: wyatt, GitHub: wyatt):


I ran into this issue yesterday and took a quick stab at a patch.

It works pretty much the same as before, but it's necessary to specify what to include when there are top level, non-package directories that contain .py files.

#!python
setup(
    # ...
    find_packages(include=['my_pep420_ns_package'])
    # ...
)

It's slightly more complicated than I'd prefer because of the need to specify include in some cases. I was thinking about putting the where directory on sys.path and then trying to import subdirectories, but I wanted to try a simple(r) approach first.

On the other hand, maybe allowing includes to be specified is generally useful. It also makes find_packages() work in a way that's more similar to how the basic packages list works (i.e., you only specify what to include). In fact, if distutils.core.setup() would allow packages to be specified using wildcards, find_packages() probably wouldn't be necessary in most cases.

Anyway, here's the patch:

#!diff

diff -r f19978009ac8 setuptools/__init__.py
--- a/setuptools/__init__.py    Thu Feb 06 09:31:48 2014 -0500
+++ b/setuptools/__init__.py    Thu Feb 06 13:09:25 2014 -0800
@@ -1,5 +1,6 @@
 """Extensions to the 'distutils' for large or complex distributions"""

+import glob
 import os
 import sys
 import distutils.core
@@ -27,31 +28,58 @@
 # Standard package names for fixer packages
 lib2to3_fixer_packages = ['lib2to3.fixes']

-def find_packages(where='.', exclude=()):
+def find_packages(where='.', exclude=(), include=()):
     """Return a list all Python packages found within directory 'where'

     'where' should be supplied as a "cross-platform" (i.e. URL-style) path; it
-    will be converted to the appropriate local path syntax.  'exclude' is a
-    sequence of package names to exclude; '*' can be used as a wildcard in the
-    names, such that 'foo.*' will exclude all subpackages of 'foo' (but not
-    'foo' itself).
+    will be converted to the appropriate local path syntax.
+
+    'exclude' is a sequence of package names to exclude; '*' can be used
+    as a wildcard in the names, such that 'foo.*' will exclude all
+    subpackages of 'foo' (but not 'foo' itself).
+
+    'include' is a sequence of package names to include. If it's
+    specified, only the named packages will be included. If it's not
+    specified, all found packages will be included (this is the same as
+    specifying include='*'). 'include' can include wild card patterns
+    just like 'exclude'.
+
+    The list of includes is built up first and then any excludes are
+    removed from it.
+
     """
+    from fnmatch import fnmatchcase
+    include = list(include or '*')
     out = []
     stack=[(convert_path(where), '')]
+    pep420 = sys.version_info[:2] >= (3, 3)
+    in_regular_pkg = False
     while stack:
         where,prefix = stack.pop(0)
         for name in os.listdir(where):
             fn = os.path.join(where,name)
+            candidate = prefix + name
+            for pat in include:
+                if fnmatchcase(candidate, pat):
+                    break
+            else:
+                continue
+            in_regular_pkg = os.path.isfile(os.path.join(fn, '__init__.py'))
+            # Namespace packages can't be nested inside regular packages
+            in_pep420_ns_pkg = (
+                not in_regular_pkg
+                and pep420
+                and glob.glob(os.path.join(fn, '*.py'))
+            )
             looks_like_package = (
                 '.' not in name
                 and os.path.isdir(fn)
-                and os.path.isfile(os.path.join(fn, '__init__.py'))
+                and (in_regular_pkg or in_pep420_ns_pkg)
             )
             if looks_like_package:
-                out.append(prefix+name)
-                stack.append((fn, prefix+name+'.'))
+                out.append(candidate)
+                stack.append((fn, candidate + '.'))
     for pat in list(exclude)+['ez_setup']:
-        from fnmatch import fnmatchcase
         out = [item for item in out if not fnmatchcase(item,pat)]
     return out
ghost commented 10 years ago

Original comment by gwideman (Bitbucket: gwideman, GitHub: gwideman):


Yes, the more I look at PEP 420-style packages, the less I am able to see what distinction there is between a package, and simply a directory tree containing python files at any level, a tree that import+sys.path can navigate arbitrarily.

Packages become whatever directories can be reached by import combined with sys.path. And neither the imports or sys.path are readily available to packaging software.

It may be that "packages" as a special level of modularity were an unnecessary idea in the first place, and it's as well to discard them.

But if packages are thought to be an important level of modularity, then it should be up to the package developer to positively ordain them (as with __init__.py), not left up to the caller and whatever the caller's sys.path happens to be. At least, so I'm thinking at the moment.

So I'm a bit mystified what the big plan is regarding packages.

ghost commented 10 years ago

Original comment by jaraco (Bitbucket: jaraco, GitHub: jaraco):


I believe this issue is more than a minor one and should be addressed sooner than later.

ghost commented 10 years ago

Original comment by jaraco (Bitbucket: jaraco, GitHub: jaraco):


I'd like to say that Setuptools should support PEP 420 packages. The harder question is how should find_packages discover such packages? PEP 420 allows importing of any importable filename ending in .py in any directory in sys.path. However, sys.path isn't relevant when building distributions and discovering packages.

I imagine setuptools could include all suitable directories, possibly omitting explicitly excluded paths. This change would be backwards-incompatible, but could provide packagers with options to produce compatible dists.

I considered using Python's import machinery to find packages, but I don't believe the import machinery provides any discovery tools.

The other thing to consider - if setuptools supports this mode, it will not work on other versions of Python, so will require Python 3.3+.

This gives me an idea - perhaps setuptools can use the package metadata to determine when PEP 420 package discovery is enabled. It could inspect the Trove classifiers and if Python versions are specified and only include versions >= 3.3, the feature would be enabled.

That technique would be somewhat implicit, but properly-documented, it could provide an elegant way to allow a packager to both include PEP 420 packages and declare the Python requirement at the same time.

jaraco commented 7 years ago

Another issue that's come up recently on distutils-sig is that nspkg.pth support is broken for PEP 420 currently -- the magic really shouldn't be running if PEP 420 support is available.

This issue has since been addressed in #805.

jaraco commented 7 years ago

That's why it would be simplest to change setuptools.dist:check_nsp to automatically add the namespace packages to the list of packages, instead of giving an error -- at that point, nobody should even notice that find_packages() isn't finding the packages in question. (Though the docs should still be updated to clarify that any init-less packages must still be manually listed in namespace_packages.)

This sounds like a fine recommendation, which I will attempt to implement.

jaraco commented 7 years ago

In the issue-97 branch, I've attempted to implement the recommendation, but I'm pretty sure this doesn't solve the problem. In particular, it still doesn't address the issue of a very basic use case like zope.interface. If zope/__init__.py does not exist, setuptools.find_packages() will still not find zope/interface/__init__.py.

estan commented 7 years ago

Anyone know if there has been any more work on this? I'm currently in the situation of wanting to switch from using an "old school" namespace package to a PEP-420 one. In my case I have only a single (company-wide) namespace package, and various distributions that install packages into it, so for me, the solution @jaraco was working on in the issue-97 branch would be great, if only find_packages was updated as well, such that it also looks at namespace_packages when taking the decision on whether it should enter a directory in its search.

What I mean is, if find_packages encounters:

some_project/setup.py
some_project/my_company/foo/__init__.py
some_project/my_company/foo/stuff.py
some_project/my_company/bar/__init__.py
some_project/my_company/bar/morestuff.py

Then find_packages would normally not enter into my_company (the PEP-420 ns package). But if it was changed to also look at namespace_packages, it would enter, because I have my_company in there (I already do that). @jaraco: Would something like that work or am I babbling? Did you have any plans like that for your branch?

I would be perfectly fine with having to explicitly specify all my namespace packages in namespace_packages (in fact, I already do that with this single ns package that I have). Having find_packages work would still be a great boon compared to explicitly having to list all packages.

jaraco commented 7 years ago

My recommendation for now is that package maintainers continue to use __init__.py files in their namespace packages to mark the folder as a package for find_packages. Either that or enumerate the packages explicitly and not use find_packages. I haven't made any advancement on the branch, but I do agree, what you propose would be preferable.

I've flagged this issue as help-wanted because I simply don't have time to address all of these issues on my own. Please feel free to grab the work I've done and advance it, or start again from the current master.

estan commented 7 years ago

@jaraco Alright, I realized that what I said earlier doesn't really make sense, since find_package has no way of getting a hold of what was passed in to namespace_packages (right?).

We're moving to PEP 420, so what I've done for now is to keep a setup_helper.py next to my setup.py with:

from distutils.util import convert_path
from fnmatch import fnmatchcase

import os

def find_packages(where='.', exclude=(), include=('*',), pep420_dirs=()):
    """Our own version of the setuptools find_package helper

    The additional 'pep420_dirs' parameter specifies directories which are
    PEP 420 packages. These will be considered packages despite lacking a
    __init__.py. We use this until setuptools issue #97 is fixed.
    """
    where = convert_path(where)
    exclude += ('*__pycache__',)
    packages = []
    for root, dirs, files in os.walk(where, followlinks=True):
        all_dirs = dirs[:]
        dirs[:] = []
        for dir in all_dirs:
            full_path = os.path.join(root, dir)
            rel_path = os.path.relpath(full_path, where)
            package = rel_path.replace(os.path.sep, '.')
            included = any(fnmatchcase(package, pat=pat) for pat in include)
            excluded = any(fnmatchcase(package, pat=pat) for pat in exclude)
            has_init = os.path.isfile(os.path.join(full_path, '__init__.py'))
            if ('.' in dir) or not (rel_path in pep420_dirs or has_init):
                continue
            if included and not excluded:
                packages.append(package)
            dirs.append(dir)
    return packages

and then I do

    packages=find_packages(pep420_dirs=['my_company']),
    namespace_packages=['my_company'],

in the setup call. It's not too pretty, but works well enough until there's a better solution.

estan commented 7 years ago

@jaraco I haven't hacked on setuptools before, but I could give it a shot.

Though, the only way I can think of to make the find_packages search be guided by both presence of __init__.py and namespace_packages is to somehow delay the entire search until namespace_packages is known.

So find_packages would do nothing but store away the fact that it has been called, its parameters, and that the search is still a TODO. And then at some later stage the actual search is done. But this would break the use of find_packages in any other context except for the call to setup(..) :(

Do you see some other way?

Perhaps find_packages should be left completely alone, and some sentient value should instead be stored, e.g. packages=RequestFindPackages(...), and then at a later stage, if isinstance(packages, RequestFindPackages) the search is performed (obviously with some better name...).

estan commented 7 years ago

@jaraco One simple solution that I kind of like, to avoid having to defer the search until namespace_packages is known, would be to change the behavior of find_packages such that a directory in include is considered a package despite lacking __init__.py, and then instruct users to add their PEP 420 packages to include.

So it would be something like

implicit_namespace_packages=[...]

setup(
    ...
    packages=find_packages(include=[...] + implicit_namespace_packages, ...)
    namespace_packages=[...] + implicit_namespace_packages
)

I know it requires the user to introduce a variable to avoid having to type the list of implicit namespace packages twice, but I could live with that. For many use cases (e.g. a "suite" of distributions) I think the list of PEP 420 packages should be short (often just a single one), and the user might then even skip the variable.

I think it makes sense that if we've now discovered that find_packages needs more info to do its job in the presence of PEP 420 packages, that the info must be passed in as a parameter.

What do you think?

estan commented 7 years ago

Regarding the first idea (deferring the search until namespace_packages is known), we could of course just do the search over again, but this time guided by namespace_packages, and during the deferred search only add packages which are not already added. A little more inefficient, but would preserve the behavior of find_packages for those who use it outside of setup(...).

But the problem with this of course is that if find_packages records the fact that it's been called (and its parameters), there's no way to know in the later stage that find_packages was called to construct packages, or if it just happened to be called for some other purpose before setup was called, so could lead to unwanted behavior. Or, put another way, there's no way to know if packages was constructed manually or by a call to find_packages.

estan commented 7 years ago

Yet another crazy idea: We could allow dict entries in packages, to signify "search requests":

setup(
    packages=[
        'some_hardcoded_package',
        { 'exclude': '*.tests' }
    ],
    namespace_packages=['my_company']
)

The dicts would support the normal find_packages parameters as keys, and would be expanded to the result of the search at a later time when namespace_packages is known. It's a little funky, but I think I like it.

Or maybe it's clearer to make it a name tuple FoundPackages à la

setup(
    packages=[
        'some_hardcoded_package',
        FoundPackages(exclude='*.tests')
    ],
    namespace_packages=['my_company']
)

No matter what, the downside of this approach is that find_packages then still won't have the capability, for the cases where someone is using it in another context. But perhaps we could also give it an implicit_namespace_packages parameter, for those use cases.

estan commented 7 years ago

Of my ideas so far, the one that is the least amount of code is to change the meaning of include. This is what it would look like:

diff --git a/setuptools/__init__.py b/setuptools/__init__.py
index 54577ce..ac7f524 100644
--- a/setuptools/__init__.py
+++ b/setuptools/__init__.py
@@ -52,18 +52,24 @@ class PackageFinder(object):
         specified, only the named packages will be included.  If it's not
         specified, all found packages will be included.  'include' can contain
         shell style wildcard patterns just like 'exclude'.
+
+        Note: PEP 420 packages will not be included unless they are explicitly
+              listed with their full names in 'include', so if you use PEP 420
+              you would use something like find_packages(include=('*', 'foo'))
+              if foo is a PEP 420 package.
         """

         return list(cls._find_packages_iter(
             convert_path(where),
             cls._build_filter('ez_setup', '*__pycache__', *exclude),
-            cls._build_filter(*include)))
+            cls._build_filter(*include), include))

     @classmethod
-    def _find_packages_iter(cls, where, exclude, include):
+    def _find_packages_iter(cls, where, excluded, included, include):
         """
-        All the packages found in 'where' that pass the 'include' filter, but
-        not the 'exclude' filter.
+        All the packages found in 'where' that pass the 'included' filter, but
+        not the 'excluded' filter. If a package is in 'include', it's considered
+        a package even if it lacks an __init__.py.
         """
         for root, dirs, files in os.walk(where, followlinks=True):
             # Copy dirs to iterate over it, then empty dirs.
@@ -76,11 +82,11 @@ class PackageFinder(object):
                 package = rel_path.replace(os.path.sep, '.')

                 # Skip directory trees that are not valid packages
-                if ('.' in dir or not cls._looks_like_package(full_path)):
+                if ('.' in dir or (not cls._looks_like_package(full_path) and package not in include)):
                     continue

                 # Should this package be included?
-                if include(package) and not exclude(package):
+                if included(package) and not excluded(package):
                     yield package

                 # Keep searching subdirectories, as there may be more packages

And then e.g.

setup(
    packages=find_packages(include=['*', 'my_company']),
    namespace_packages('my_company')
)
estan commented 7 years ago

I realize I've had a misconception about one thing in my thinking: I thought that you still had to list PEP 420 packages in the namespace_packages parameter to setup. But I realize now that you only need to do this for old style namespace packages (those with __init__.py). So ignore all my talk about trying to avoid repetition.

In light of that, I think the absolutely best/easiest solution is to simply add an namespace_packages parameter to find_packages, where the use must list any PEP 420 packages explicitly. I'll prepare a PR to do so.

illume commented 7 years ago

Hello.

I made a small patch here with something that works for me, but needs a bit more work after discussion: https://github.com/pypa/setuptools/pull/943

This is a solution so that the simple case works without requiring any extra parameters be used.

I modified sampleproject here: https://github.com/illume/sampleproject Note how there is only a file sample/sample.py and no __init__.py.

This is the simplest possible package. I think this is a very important use case for many people, and I don't think find_packages() should require doing anything special with any extra parameters for this case. I've seen many people trip over this issue, since it's pretty much the simplest type of package you can make.

It works for me on the sampleproject above, and some other projects I tested. I handled the case where it ignores directories with site-packages in them (probably a venv, definitely not a package).

There are a few failing tests still. Also, I need to make it only look for namespace packages in python 3.3+

What do you think?