pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
42.56k stars 17.56k forks source link

Suggestion: remove tests from the distribution #30741

Open vfilimonov opened 4 years ago

vfilimonov commented 4 years ago

Would it make sense to remove tests folder from the pandas distribution? It takes roughly 33% of the whole package weight.

It is especially important when using pandas inside the AWS Lambdas, where the deployment package size is limited to 50 MB zipped and 5 MB might really make a difference.

# Uncompressed
du -h -s pandas*
 46.5M  pandas
 30.9M  pandas_no_tests

# Compressed
du -h -s pandas*
 14.7M  pandas.zip
 10.1M  pandas_no_tests.zip
TomAugspurger commented 4 years ago

I think we've talked about that in the past. We do have pandas.test() as part of the public API however, so we'd need to consider that.

A couple options:

  1. Provide a separate distribution like pandas-slim or something that excludes these files (and docs)
  2. Have pandas.test() fetch the source files on demand. That seems a bit messy though.

Just as a note: we do exclude the test data files that are present in the git repository. So we're only talking about source files.

vfilimonov commented 4 years ago

Hello @TomAugspurger

pandas-slim sounds like an good workaround.

It looks like docs are not a part of the distribution. And right it's tests code only without the data files - in terms of size they are second to _libs and almost equal to the rest of the code.

8.0K    ./arrays
 16K    ./errors
 24K    ./api
 40K    ./__pycache__
 76K    ./compat
 88K    ./_config
248K    ./tseries
308K    ./util
440K    ./plotting
2.3M    ./io
6.7M    ./core
 17M    ./tests
 20M    ./_libs

And what is the reason of having tests as a part of an API? I see that numpy, scipy, matplotlib etc are doing the same (while many other libs, especially web-oriented like flask, requests, jinja don't)?

TomAugspurger commented 4 years ago

Thanks for correcting me on the docs stuff.

I'm not sure why all these projects tend to include tests. As a guess, I suspect that we have more issues related to how the packages are built (compilers, flags, libraries we link to) than they do.

On Mon, Jan 6, 2020 at 1:56 PM Vladimir Filimonov notifications@github.com wrote:

Hello @TomAugspurger https://github.com/TomAugspurger

pandas-slim sounds like an good workaround.

It looks like docs are not a part of the distribution. And right it's tests code only without the data files - in terms of size they are second to _libs and almost equal to the rest of the code.

8.0K ./arrays 16K ./errors 24K ./api 40K ./pycache 76K ./compat 88K ./_config 248K ./tseries 308K ./util 440K ./plotting 2.3M ./io 6.7M ./core 17M ./tests 20M ./_libs

And what is the reason of having tests as a part of an API? I see that numpy, scipy, matplotlib etc are doing the same (while many other libs, especially web-oriented like flask, requests, jinja don't)?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/30741?email_source=notifications&email_token=AAKAOIUZLK4U3RLNROXYKADQ4OEHPA5CNFSM4KDF2MLKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIGS7KI#issuecomment-571289513, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIR5SCPXJ5N7HFOBDRDQ4OEHPANCNFSM4KDF2MLA .

stonecharioteer commented 4 years ago

I particularly used to run the tests for numpy, scikitlearn and matplotlib after installing, since at times I'd have them fail on Windows. However this was quite some time ago, 4 years ago perhaps. Perhaps other users were doing the same?

jonringer commented 4 years ago

I'm a package manager for nixpkgs, and I'm against removing tests from the sdist package, however, removing from the wheel would make sense from a packaging standpoint. It's considered best practice in FOSS that if you distribute source, you also distribute tests along side it.

EDIT: tests are a nice guarantee that the package is working as intended.

We could also checkout the github repo for tests. However, quickly looking at the setup.py, pandas was meant to have the CI set correct metadata such as version. So the version-controlled source can't be directly be used to package pandas.

joaoe commented 4 years ago

Hi. i reported this elsewhere, so I'm pasting my comment here.

My use case After a pip install pandas the lib/site-packages/pandas/tests/ includes a lot of testing code which is definitely not relevant for me and many other end users of pandas. This bloats the installation and makes installation slower. I'm working on packaging a python environment to distribute with a preinstalled set of modules and application and there are too many popular 3rd-party modules which include unneeded test code, like numpy, IPython, jupyterlab, etc, which needs to be striped to keep the package size down. I'll be reporting issues to these projects as well.

Suggestion Therefore, my suggestion is to keep the pandas module streamlined, and move the tests out. Perhaps create a pandas-unittests module if people are interested in it, or just expect users to checkout the code. Another possibility would be to skip packaging the tests folder and conftest.py when creating packages to upload to pypi.org.

Regarding pandas-slim, everyone and their mothers have a dependency on pandas which would pull the whole code again with tests.

It's considered best practice in FOSS that if you distribute source, you also distribute tests along side it.

That perfectly fine. The discussion is whether tests are bundled with the pandas module or not.

Since you are now almost releasing 1.0 it might be a bit short notice to include this is such a big release. But for the next major release, it could work.

Thank you very much for your attention.

vfilimonov commented 4 years ago

A small remark: as a part of recent commit to pyarrow @wesm removed pyarrow.tests from the wheel which to my understanding contributed 2.3 MB of ~60 MB installed size.

In case of pandas tests folder contributed (as of version 1.0.3) tests folder contributed 17.9 MB out of 49 MB installed size.

So I'd like to bring the question back to the discussion and perhaps, @wesm could comment on that?

wesm commented 4 years ago

I think it would be a good idea to not ship the tests in wheels. If you want users to be able to run the tests against their production installs perhaps the tests can be packaged as a separate source wheel. Install size is becoming a problem because of size constraints in things like AWS Lambda.

TomAugspurger commented 3 years ago

https://uwekorn.com/2020/10/28/trimming-down-pyarrow-conda-2-of-x.html has some information.

I think I've come around to the idea that we can just not ship the test files in the main pandas distributions. We can have a separate pandas-tests so that pip install pandas-tests package that's just

  1. The test files
  2. A small __init__.py file ties things together

We could even update pandas.test() to check for the presence of the pandas-tests package.

dazza-codes commented 3 years ago

In https://github.com/rasterio/rasterio-wheels/issues/59, I'm requesting some kind of shared-libs solution that might help the python package ecosystem for binary wheels and a similar request could be made for shared testing libs. In the context of testing, it might/could help to have separate numpy.testing and pandas.testing projects that might be integrated into a parent sci-testing or similar uber-testing package with shared dependencies (it likely impacts other projects that are also using a pattern of including some kind of {package}.test utilities by default (vs. optional extra package). It's likely that whatever pandas needs for testing might also depend on numpy.testing and if that package is split out of the upstream numpy (similar to https://github.com/numpy/numpy/issues/17620 ??), then it would have implications for work on this and similar issues. (Might be useful to check a git grep 'numpy.testing' or similar grep on the pandas repo.)

jreback commented 3 years ago

pandas does not use numpy.testing a all

virtuald commented 3 years ago

I don't use pandas at all, but I saw the link from the numpy issue.

It's worth noting that with implicit namespace packages you could technically ship pandas.tests in a separate wheel and wouldn't break backwards compatibility (other than the new requirement). I've found the primary downside of namespace packages is that editable installations have annoying caveats, but in an installed system it seems to be supported just fine.

jorisvandenbossche commented 3 years ago

For the conda side, @xhochy did this for pyarrow here https://github.com/conda-forge/arrow-cpp-feedstock/pull/186, so that could be inspiration for doing the same for the pandas conda package.

jameslamb commented 3 years ago

It sounds like maintainers here are generally supportive of cutting tests and docs out of the package. I recently went through this exercise in LightGBM (https://github.com/microsoft/LightGBM/pull/3639, https://github.com/microsoft/LightGBM/pull/3685), and I'd be happy to do the work here to reduce the package size. Would the maintainers here be open to PRs if I started working on this?

Details & Background

I think Wes's comment about AWS Lambda is an important one: https://github.com/pandas-dev/pandas/issues/30741#issuecomment-638158062. A few months ago, I worked on trying to use pandas + scikit-learn + lightgbm together in Lambda Layers, and had to do some weird stuff to make it work while staying under the 250 MB limit for layers, like this:

TEMP_LAYER_DIR=$(pwd)/temp-layer-dir
mkdir -p ${TEMP_LAYER_DIR}

echo "Creating 'pandas-layer'"
docker run \
    --rm \
    -v ${TEMP_LAYER_DIR}:/var/task \
    lambci/lambda:build-python3.7 \
        pip install --no-deps -t python/lib/python3.7/site-packages pandas==0.24.1 pytz

pushd ${TEMP_LAYER_DIR}
zip \
    --exclude \*/tests/\* \*dist-info\* \*/__pycache__\* \
    -r pandas-layer.zip \
    python
mv *.zip ${UPLOAD_DIR}/
rm -r python

I think a first step that wouldn't need to involve changes to pandas.test() could be removing documentation from the source distribution. I can see most of https://github.com/pandas-dev/pandas/tree/master/doc in the sdist package, even the cheatsheet powerpoints / PDFs.

wget https://files.pythonhosted.org/packages/75/bc/abf2e8cc6a9d918008774b958613cfdbd3a8c135cffb0757f78fabd8268f/pandas-1.2.0.tar.gz -O pandas.tar.gz
tar -xvf pandas.tar.gz
cat pandas-1.2.0/pandas.egg-info/SOURCES.txt | grep -E "^doc"

Thanks for your time and consideration!

jreback commented 3 years ago

yup would take PRs here

jameslamb commented 3 years ago

Thanks! I've opened #38846 with a proposal to remove docs. It reduces the size of the sdist package by about 1MB.

I think that removing tests needs to be done by maintainers, though, if it will involve distributing a new pandas-tests package (https://github.com/pandas-dev/pandas/issues/30741#issuecomment-719609290).

I really hope that is picked up, as it would have a significant impact on the size of the package.

master #38846 #38846 + removing tests
sdist (compressed) 5.7M 4.5M 2.3M
sdist (uncompressed) 25M 19M 7.7M
python setup.py sdist > /dev/null
du -a -h dist/
Dr-Irv commented 3 years ago

Thanks! I've opened #38846 with a proposal to remove docs. It reduces the size of the sdist package by about 1MB.

I think that removing tests needs to be done by maintainers, though, if it will involve distributing a new pandas-tests package (#30741 (comment)).

I really hope that is picked up, as it would have a significant impact on the size of the package.

master #38846 #38846 + removing tests sdist (compressed) 5.7M 4.5M 2.3M sdist (uncompressed) 25M 19M 7.7M

python setup.py sdist > /dev/null
du -a -h dist/

@jameslamb Can you check your calcs on the difference if tests are removed? I think you included the folder tests/io/sas/data which takes up 12MB, but it is not in the distribution - just in the source repo.

jameslamb commented 3 years ago

I just tried removing that directory completely before running the size checks. The results are the same

# check size
rm -rf pandas.egg-info
rm -rf dist
python setup.py sdist
du -a -h dist/

# check size without tests/io/sas/data/
rm -rf pandas/tests/io/sas/data/
rm -rf pandas.egg-info
rm -rf dist
python setup.py sdist
du -a -h dist/

I also tried this with prune pandas/tests added to MANIFEST.in (which is how I originally calculated the "without tests" numbers).

with pandas/tests/io/sas/data in the repo

with pandas/tests/io/sas/data removed from the repo

When I check pandas.egg-info/SOURCES.txt in all cases, I don't see tests/io/sas/data anywhere.

This is the expected behavior. I think all the data files you're referring to will be excluded by these rules.

It is interesting that the new compressed size I'm seeing is different from when I ran these checks a week ago. But this is a very active project so I'm assuming that that's related to some new PRs that have been merged.

Dr-Irv commented 3 years ago

@jameslamb I was surprised that in your original numbers, with the uncompressed version, the result lowered from 19M to 7.7M, so that's why I thought the SAS data was the culprit.

If I look at my installation of pandas 1.1.3, the uncompressed installed tests directory contains just over 9MB. Still justifies figuring out if we can leave it out of the dist.

jameslamb commented 3 years ago

Providing an update here, it seems that distributions of pandas have grown by maybe ~15MB (uncompressed) since I last calculated it in December (https://github.com/pandas-dev/pandas/issues/30741#issuecomment-752838856).

On my laptop (running Python 3.7 and Ubuntu 18.04), running the following

# install pandas 1.2.3
pip install --upgrade pandas

# check sizes
du -a ${HOME}/miniconda3/lib/python3.7/site-packages/pandas \
   | sort -n -r \
   | head -n 10

results in

68548   pandas
35944   pandas/tests
17188   pandas/_libs
10000   pandas/core
5028    pandas/tests/indexes
4996    pandas/_libs/tslibs
4700    pandas/tests/frame
4620    pandas/tests/io
3376    pandas/tests/series
3244    pandas/io

The tests directory now looks to be about 35MB uncompressed.

I would help with this, but I think that removing tests from pandas distributions is a fairly invasive change (see @TomAugspurger 's comment from October: https://github.com/pandas-dev/pandas/issues/30741#issuecomment-719609290) that has to be done by maintainers. I'm personally +1 on doing that, for the reasons mentioned in this thread and in https://github.com/pandas-dev/pandas/pull/38846#issuecomment-786956192.

TomAugspurger commented 3 years ago

I'm also +1 on removing tests from the distribution (and optionally pulling them down on demand for pandas.test()), but won't have time to work on it.

jreback commented 3 years ago

agree we should try to package separately but this would take some dedicated work

jayqi commented 3 years ago

Hi, I was following along with PR #38846 which was recently merged, and I wanted to discuss the idea that I repeatedly saw @WillAyd bring up: specifically the idea of implementing a pandas[no-doc] PEP-508-compliant dependency specification as a way to give users an option to opt out of documentation in their packages. Unfortunately, I think this idea is technically infeasible given the mechanics of PEP 508 dependencies. (Would be very happy to be proven wrong because this has been an annoying limitation in other projects.)

Basically, the problem is that dependency specifications only allow adding dependencies, not subtracting them. This is reflected in the actual PEP 508 language (emphasis mine):

A dependency specification always specifies a distribution name. It may include extras, which expand the dependencies of the named distribution to enable optional features.

An extra is an optional part of a distribution. Distributions can specify as many extras as they wish, and each extra results in the declaration of additional dependencies of the distribution when the extra is used in a dependency specification. For instance:

There an ongoing discussion on the Python Software Foundation discourse about how to implement dependency specifications that can be opted out of.

In the context of solutions for pandas, this means that the only way to give users an option of not having a dependency is to leave them out of the base distribution, and then having them be opt-in.

I'd also like to note that my understanding is that PEP 508 dependency specifications are for dependencies on other Python packages, and usually not about including additional package data. So it's not even clear to me how one would implement a pandas[docs] opt-in solution, though perhaps there is a creative way to do so.

mroeschke commented 1 year ago

One potential difficulty before implementing this:

We note for EA authors that there is a set of test that can be run to check EA compatibility: https://pandas.pydata.org/docs/development/extending.html#testing-extension-arrays

So if tests were removed from the distribution, a subset of pandas/tests/extensions/ probably should not be removed

viccsjain commented 9 months ago

Splitting the pandas library and tests would be really useful. We are using this library in our serverless deployment. and there is size restriction to upload the package into AWS lambda of 250 MB. Removing tests file will reduce the size of our package.

thesamesam commented 9 months ago

See also the discussion in https://github.com/pandas-dev/pandas/issues/54907.

jbsilva commented 9 months ago

Making docs and tests optional would be great. In my cloud deployments I repackage it without the tests; 15 MB do make a difference for me. I've seem many other packages including tests, but never that big.

dolfinus commented 1 month ago

I was checking the size of one of my docker images, and found that tests are about 50% of the size of installed package: изображение

Completely waste of space for me.