pypa / setuptools

Official project repository for the Setuptools build system
https://pypi.org/project/setuptools/
MIT License
2.5k stars 1.19k forks source link

[BUG] Selection of package data in sdist and wheels with `include-package-data` keyword #4638

Closed DanielYang59 closed 1 month ago

DanielYang59 commented 1 month ago

setuptools version

setuptools==74.1.2

Python version

Python 3.12.5 (MacOS) Python 3.10.12 (Ubuntu-WSL2)

OS

MacOS Sonoma 14.6.1 and Ubuntu-WSL2 22.04 LTS

Additional environment information

Well I haven't installed setuptools on my fridge (yet), maybe later

Description

I was trying to replace explicit declare of package data with include-package-data = true for pymatgen, but noticed python3 -m build could generate source distribution with package data missing (even after I removed dist and src/pymatgen.egg-info). I believe no other files were controlling this process (no MANIFEST.in, no related args in setup.py).

Here's the related section in pyproject.toml file (this is the complete file):

[build-system]
build-backend = "setuptools.build_meta"

[tool.setuptools]
include-package-data = true

[tool.setuptools.packages.find]
where = ["src"]
include = ["pymatgen", "pymatgen.*"]

Expected behavior

I would expect python3 -m build to behave consistently and generate wheels and sdist exactly as specified in pyproject.toml file.

The generated wheel (the wheel on pypi for example) and sdist should include multiple package data files (json and more):

wget https://files.pythonhosted.org/packages/7f/9e/b8b1014e4a81ffb048a905e71efde31a4512ea4ae00421356ea62c52ad2d/pymatgen-2024.8.9-cp312-cp312-macosx_11_0_arm64.whl

>>> unzip -l pymatgen-2024.8.9-cp312-cp312-macosx_11_0_arm64.whl  | grep json
    16284  2024-08-09 17:16   pymatgen/entries/calc_compounds.json.gz
    34835  2024-08-09 17:16   pymatgen/entries/exp_compounds.json.gz
    17909  2024-08-09 17:16   pymatgen/entries/data/nist_gas_gf.json
    40327  2024-08-09 17:16   pymatgen/entries/data/g_els.json
...  more 

How to Reproduce

git clone --branch package-data https://github.com/DanielYang59/pymatgen.git --depth 1
cd pymatgen
git checkout c8a3a50e3e92a3a66bd782d75e3f53e0f7252e76  # in case I pushed more commits
python3 -m venv venv
source venv/bin/activate
pip install -U build setuptools  # Successfully installed build-1.2.2 packaging-24.1 pyproject_hooks-1.1.0 setuptools-74.1.2 tomli-2.0.1
python3 -m build

Inspect the source distribution:

tar -tvf dist/pymatgen-2024.8.9.tar.gz | wc -l
>>> 350  # should be around 500 files

tar -tvf dist/pymatgen-2024.8.9.tar.gz  | grep json
>>> nothing

Inspect the wheel:

unzip -l dist/pymatgen-2024.8.9-cp310-cp310-linux_x86_64.whl
>>> 294 files  # should be around 500

unzip -l pymatgen-2024.8.9-cp310-cp310-linux_x86_64.whl | grep json
>>> nothing

Output

>>> python3 -m build

Output as file (too many lines): build.log

Generated sdist (all .json, .csv, *.yaml...missing): pymatgen-2024.8.9.tar.gz

The same to the wheel (cannot upload as .whl is not supported).

abravalheri commented 1 month ago

Hi @DanielYang59 please note that if you want json and other files included in the sdist via include-package-data you do need to add them either via a plugin like setuptools-scm or MANIFEST.in.

This condition is documented in https://setuptools.pypa.io/en/latest/userguide/datafiles.html#include-package-data:

First, you can use the include_package_data keyword ... ... files will be automatically installed with your package, provided:

  1. These files are included via the MANIFEST.in file, ...
  2. OR, they are being tracked by a revision control system such as Git, Mercurial or SVN, and you have configured an appropriate plugin ...

What include-package-data = true does is to copy all data files in the sdist to the wheel, but these files do need to be included in the sdist first.

The likely reason why the sdist on PyPI have the files is because you are using tool.setuptools.package-data. This behaviour is also documented in https://setuptools.pypa.io/en/latest/userguide/datafiles.html#package-data:

Note that the data files specified using the package_data option neither require to be included within a MANIFEST.in file, nor require to be added by a revision control system plugin.

I believe include-package-data behaviour to be consistent in setuptools since v58.5.3.

DanielYang59 commented 1 month ago

Hi @abravalheri thanks a loooooooooot for your quick response and the information (I should have read the documentation more carefully).

I could confirm using the following includes everything (seemingly, and I would need a closer inspection later) inside the src into the package data:

[tool.setuptools]
include-package-data = true

[tool.setuptools.package-data]
"*" = ["*", ]

[tool.setuptools.packages.find]
where = ["src"]
include = ["pymatgen", "pymatgen.*"]

I believe include-package-data behaviour to be consistent in setuptools since v58.5.3.

For some reason sometimes changes in pyproject.toml don't alter the built sdist/wheels, I guess I would need to completely clear the cache (I guess it would be a reasonable feature request to submit?).

Thanks again!

DanielYang59 commented 1 month ago

Hi @abravalheri thanks again for the comprehensive reply. I went through the Data Files Support page again and there're several points that are still not fully clear to me (bear with me for the newbie questions).

Usage of package-data section and include-package-data keyword in pyproject.toml

  1. If I understand correctly, when package-data section is not declared, all non-.py files would be considered package data if they're either listed in "MANIFEST.in" or tracked by version control system (with correct plugin).

By default, include_package_data considers all non .py files found inside the package directory (src/mypkg in this case) as data files, and includes those that satisfy (at least) one of the above two conditions into the source distribution, and consequently in the installation of your package.

Or, we could explicitly declare "package-data" through the package-data section regardless of they are inside the MANIFEST.in file or tracked by VCS.

  1. What include-package-data = true does is to copy all data files in the sdist to the wheel, but these files do need to be included in the sdist first.

From the above docs, it seems to me the [tool.setuptools.package-data] section is used to declare non-default files to be counted as "package data", and with include-package-data = true, these "package data" are then included into the source distribution?

But wait, why should we declare "package data" if we don't want to pack them into the package (include-package-data = false), perhaps there are other usage that I'm not aware of?

What about "package data" in "source distribution" vs "wheels"?

  1. If I understand correctly, the sdist would contain all files needed to build the wheels (plus above declared non-.py "package data"). And during wheels building, if include_package_data = true, those "package data" would then be copied to the wheels? Therefore, the include-package-data controls the "package data" in both sdist and wheels (either both have "package data" if true, or neither)?

The definition of "package", "source distribution" is a bit unclear (perhaps just to me)

  1. It's a bit unclear in the documentation whether "package" is refering to the "Python package" (a collection of Python modules), versus the collection of files include non-py files ("source distribution"?) we ship, for example:

    the most common use case for data files distributed with a package is for use by the package, usually by including the data files inside the package directory.

Or:

then all the .txt and .rst files will be automatically installed with your package, provided:

  1. Is "package" (collection of files we ship) the same as "source distribution"?

In a word, is the following diagram correct?

As Python is an interpreted language, .py files would remain as is, but Cython/C codes would be compiled as binary extensions (.so/.pyd). And additional package data files would also remain as is in the wheels?

image
DanielYang59 commented 1 month ago

Update: I think I understand now.

Just created a demo package: demo_project.zip

.
├── pyproject.toml
├── setup.py
└── src
    └── demo
        ├── pkg_1
        │   ├── __init__.py
        │   ├── data_1.txt
        │   └── module_1.py
        └── pkg_2
            ├── __init__.py
            ├── cython_from_pmg.pyx
            ├── data_2.csv
            └── module_2.py

5 directories, 9 files

And looks like include-package-data is not doing anything, see the following table:

include-package-data
TRUE FALSE
package-data set wheel and sdist have txt/csv wheel and sdist have txt/csv
not set wheel and sdist have no txt/csv wheel and sdist have no txt/csv

In other words, if the [tool.setuptools.package-data] section is set, both sdist and wheel would contain data_*.csv/txt, regardless how include-package-data is set, and vice versa.

If I understand correctly, include-package-data is not doing anything because: 1. I don't have a MANIFEST.in file; 2. No VCS plugin is used, and therefore it's effectively as false no matter how it's set.

By default, include_package_data considers all non .py files found inside the package directory (src/mypkg in this case) as data files, and includes those that satisfy (at least) one of the above two conditions into the source distribution, and consequently in the installation of your package.

And I don't think the following is quite right? i.e. include-package-data doesn't really control whether data files from sdist would be copied to wheels (it always copies).

What include-package-data = true does is to copy all data files in the sdist to the wheel, but these files do need to be included in the sdist first.

abravalheri commented 1 month ago

And I don't think the following is quite right? i.e. include-package-data doesn't really control whether data files from sdist would be copied to wheels (it always copies).

What include-package-data = true does is to copy all data files in the sdist to the wheel, but these files do need to be included in the sdist first.

Assuming that you put a MANIFEST.in or use something like setuptools-scm and not use package-data, than you can add include-package-data = false and that will have an effect.

abravalheri commented 1 month ago

This is how these conditions work togheter (in terms of Boolean expressions):

Notation:

i - include-package-data = true e - exclude-package-data is set to remove data files p - packag-data is set to include data files m - MANIFEST.in exists and includes data files w - data file included in the wheel distribution s - data file included in the sdist distribution

x': the negated form of x, in other words x' = not(x) x.y: logical and between x and y x+y: logical or between x and y


s = m + e'.p
w = i.e'.m + e'.p  -- Which can be simplified => e' . (i.m + p)
                   -- and is also equivalent to => e' . (i.s + p)

This behaviour is stable since v58.5.3, there is an experiment here showing those results https://github.com/abravalheri/experiment-setuptools-package-data?tab=readme-ov-file#results-for-setuptools6060.

DanielYang59 commented 1 month ago

And I don't think the following is quite right? i.e. include-package-data doesn't really control whether data files from sdist would be copied to wheels (it always copies). What include-package-data = true does is to copy all data files in the sdist to the wheel, but these files do need to be included in the sdist first.

Assuming that you put a MANIFEST.in or use something like setuptools-scm and not use package-data, than you can add include-package-data = false and that will have an effect.

I could confirm it's right! i.e. with no [tool.setuptools.package-data] section and package data declared in MANIFEST.in:

But I have a feeling that this is not clear from the documentation? Perhaps it's best to update that?

This is how these conditions work together (in terms of Boolean expressions):

Thanks for sharing! I must have seen this while searching in setuptools issues, this is very complicated so great to have this cheat sheet!

abravalheri commented 1 month ago

But I have a feeling that this is not clear from the documentation? Perhaps it's best to update that?

Definitely. Would you like to propose a PR for that? I have been through this docs passage many times and we had a couple of contributors changing it back and forth, so it is a bit tricky to get it right (at this stage I am also very "desensitised" to the text). Probably a fresh pair or eyes would help.

DanielYang59 commented 1 month ago

With pleasure, I would open a PR for that soon.

Probably a fresh pair or eyes would help.

Fully understandable, perhaps it good for a newcomer (me) to play a part such that other new users could easily understand too, because as a newcomer it's clearer to me what might be missing from the docs.

Any thanks a ton for the detailed explanation and examples again!

DanielYang59 commented 4 weeks ago

This is how these conditions work togheter (in terms of Boolean expressions):

Notation:

i - include-package-data = true e - exclude-package-data is set to remove data files p - packag-data is set to include data files m - MANIFEST.in exists and includes data files w - data file included in the wheel distribution s - data file included in the sdist distribution

x': the negated form of x, in other words x' = not(x) x.y: logical and between x and y x+y: logical or between x and y

s = m + e'.p
w = i.e'.m + e'.p  -- Which can be simplified => e' . (i.m + p)
                   -- and is also equivalent to => e' . (i.s + p)

I really appreciate this, which makes everything very clear and should be included into the docs (I believe). One minor comment though (I could be missing some background here), as "include or not" is a binary logic, can we use the standard logical connectives instead of creating definition ourselves? It was a bit hard to read, and defining + as OR is a bit counter-intuitive to me (I would read + as AND in my brain ).

¬ for negation (NOT)
∧ for conjunction (AND)
∨ for disjunction (OR)
abravalheri commented 4 weeks ago

There are many popular notations for Boolean Algebra. I found this article that seems to summarise the main ones: https://www.allaboutcircuits.com/technical-articles/boolean-basics/. The notion of a "standard" notation in this scenario may be biased by personal experience or field of study.

DanielYang59 commented 3 weeks ago

Thanks a lot for the input, yes I thought there could be background that I'm not aware of :)

I could be missing some background here