keithrozario / Klayers

Python Packages as AWS Lambda Layers
Other
2.18k stars 313 forks source link

Scipy lambda layer for 3.9 and 3.10 #360

Open explomind1 opened 1 year ago

explomind1 commented 1 year ago

Since the current AWS lambda layers doesnt support scipy only on 3.9 and above, it would be great if we could create an arn for scipy as well. Does anyone when will there be a aws layer for scipy for python 3.9 and 3.10

I have tried creating a custom layer for scipy that supports 3.9 o 3.10, however, it always gives C-extension error or says the the scipy module is broken when i try to create it from the cloud9 ide without numpy and then upload back to lambda. Moreover, it is not possible to add scipy from the cloud9 as well because it is above the mb limit that lambda can handle (the only way is to delete the numpy directories and scipy can be succesfully installed to lambda without any errors.

I would really appreciate it if anyone knows when will AWS will provide a aws layer just like in 3.7 or 3.8.

keithrozario commented 1 year ago

If you're asking about the office AWS layer, I don't really know.

We can try to add Scipy for 3.10 here, but we may run into the size MB limit, which is a hard limit that can't be worked around.

dschmitz89 commented 10 months ago

Scipy wheels are roughly 30-40 MB in size lately: https://github.com/scipy/scipy/releases/tag/v1.11.4 . Does that seem too much?

I would like to see if I can help out with this issue. As regular SciPy contributor, I am familiar with the scipy tooling and I use Lambda at my day job but I am pretty new to Lambda layer creation. Do you have old scripts for SciPy still lying around?

keithrozario commented 10 months ago

If someone would make a pull request to add these packages, then I'll merge them and automatically build :)

keithrozario commented 10 months ago

I tried building SciPy for Lambda, but currently it's size exceeds the accepted in Lambda.

Lambda has a limit of 50MB, and ScIPy size is above that (~57MB). Note this is the result of a pip install scipy ... which includes not just SciPy but numpy as well.

I will see if we can remove the cache files to resuce the size, but at the moment this is the size :(

keithrozario commented 10 months ago

Currently the output looks like this: image

keithrozario commented 10 months ago

I will experiment with removing pycache directories --- and just keeping the pycache directories to see what happens.

rgommers commented 10 months ago

IIRC the size limit is 250 MB unzipped, rather than 50 MB on upload.

You can significantly cut down the size by deleting all the tests/ directories. Also, you probably don't need the 3 scipy/misc/*.dat test images and they are large. Deleting all that may cut the package size by ~25% or so.

It used to be possible to get numpy/scipy/pandas in a single layer. I'd be curious what the status is now.

keithrozario commented 10 months ago

Thanks, I'll check and see if it's possible. But there's a lot of bespoke effort that may be unsustainable.

The Lambda limit is 50MB zipped, and currently the total zipped size is bigger than that :(.

gpap-gpap commented 10 months ago

I am also interested in a scipy layer for 3.10+, and can't find a workaround for the size limit. I am not sure if you already do this but running something like find . | grep -E "(/tests$|__pycache__$|\.pyc$|\.pyo$)" | xargs rm -rf before zipping gets rid of files that are not needed in the layer. If that fails then all you can do is install submodules of scipy separately as needed which is not ideal

dschmitz89 commented 9 months ago

Friendly ping: was there any progress here? For the custom removal of code, is it possible to automatically inject such package specific code into the whole terraform build script?

keithrozario commented 9 months ago

If someone could modify the build function, that'd be much appreciated :). I think for now we can remove all pycache files to save space, that may help.

alexiskat commented 9 months ago

Not sure if this would help at all but this saved a lot of space when building the layer.

docker run -v "$PWD":/var/task "public.ecr.aws/sam/build-python3.9" /bin/sh -c "pip install -r requirements.txt --platform manylinux2014_x86_64 --implementation cp --python 3.9 --only-binary=:all: --upgrade --trusted-host pypi.org --trusted-host files.pythonhosted.org -t python/lib/python3.9/site-packages/; exit"

aperture147 commented 7 months ago

IIRC the size limit is 250 MB unzipped, rather than 50 MB on upload.

You can significantly cut down the size by deleting all the tests/ directories. Also, you probably don't need the 3 scipy/misc/*.dat test images and they are large. Deleting all that may cut the package size by ~25% or so.

It used to be possible to get numpy/scipy/pandas in a single layer. I'd be curious what the status is now.

Tested and it works. I also added --no-compile and delete all dist-info directory and now NumPy, SciPy and Pandas can be placed in a single layer. All of them take approx 195M, so I can have extra 50M for all of my imagination.

This is one of the most hilarious black magic I've ever seen.

keithrozario commented 7 months ago

Wow. I need to find someway to automate this. What does --no-compile do?

rgommers commented 7 months ago

@keithrozario we have just implemented proper support for this in NumPy, via "install tags" in the build system. Here is how to use it: https://github.com/numpy/numpy/issues/26289#issuecomment-2068943795. I'm planning to do the same for SciPy. It would come down to adding -Cinstall-args="--tags=runtime,devel,python-runtime" to your pip install (or pip wheel or python -m build) invocation in order to drop the test suite.

--no-compile is a pip flag: https://pip.pypa.io/en/stable/cli/pip_install/#cmdoption-no-compile

That together should make all this a one-liner. It should work for NumPy now.

aperture147 commented 7 months ago

Wow. I need to find someway to automate this. What does --no-compile do?

It will not precompile python code into byte code during the install process. But the test suites are those which consuming a lot of megabytes, the byte code takes just a few megabytes at most.

My approach is summed up in this script:

# install cpython implementation only
pip install numpy pandas scipy --no-compile --implementation cp -t python

# remove all dist-info
rm -r *.dist-info

# delete all tests directories
find . | grep -E "*/tests$" | xargs rm -rf

# clean up python byte code if any
find . | grep -E "(/__pycache__$|\.pyc$|\.pyo$)" | xargs rm -rf

# Xoá cả pyproject vì không cần đến
find . | grep -E "pyproject.toml$" | xargs rm -rf

# delete unused .dat file which is deprecated since scipy 1.10
find . | grep -E "scipy\misc\*.dat$" | xargs rm -rf

Btw i think modifying the bundled source code is not a good practice tho.

aperture147 commented 7 months ago

@keithrozario we have just implemented proper support for this in NumPy, via "install tags" in the build system. Here is how to use it: numpy/numpy#26289 (comment). I'm planning to do the same for SciPy. It would come down to adding -Cinstall-args="--tags=runtime,devel,python-runtime" to your pip install (or pip wheel or python -m build) invocation in order to drop the test suite.

--no-compile is a pip flag: https://pip.pypa.io/en/stable/cli/pip_install/#cmdoption-no-compile

That together should make all this a one-liner. It should work for NumPy now.

I don't get why numpy and scipy has their test suite in the wheel, when they don't contribute anything on the run process. I thought it was the sanity check in every import but it's just the test package during the meson build phase. It's bummer to rebuild numpy just to get rid of the test suite.

rgommers commented 7 months ago

@aperture147 that's historical. Once upon a time, many more users built from source. And then it was critical to be able to run tests with numpy.test() in order to diagnose all sorts of weird issues. Having tests in tests/ subfolders of the package used to be very common, maybe even the standard way for where to store tests.

For new projects started today, the test suite usually goes outside of the importable tree. Moving everything in numpy now though would be very disruptive, as it would (among other things) make all open PRs unmerge-able.

keithrozario commented 7 months ago

Thank you so much all. I'll look into this next week or so, and hopefully we can get a layer of scipy out!!!

I'm not sure how much of this is generic (can be applied to all packages) and how much is specific to scipy though. Will have to think a bit more.

aperture147 commented 7 months ago

AFAIK, SciPy and NumPy are safe to have tests directory removed. NumPy tests directory size is even larger than SciPy.

keithrozario commented 6 months ago

Test layer is here:

arn:aws:lambda:ap-southeast-1:367660174341:layer:Klayers-p312-scipy:1

We perform the --no-compile flag to reduce the .pyc and pycache files, and also delete all directories marked 'tests', as recommended by experts on this thread :)

Feel free to run some test on the layer. If all goes well, I'll push this before end of this week into production, and we'll have 'optimized' builds going forward.

keithrozario commented 6 months ago

I forgot to remove .dat and dist-info as well. That's up next.

rgommers commented 6 months ago

I'll need to write docs for it, but this command will already remove test data is well as some large-ish _test_xxx.so extension modules that live outside of the tests/ directories:

$ python -m build -wnx -Cinstall-args=--tags=runtime,python-runtime,devel

It's available in SciPy's main branch since a week ago (https://github.com/scipy/scipy/pull/20712).

I forgot to remove .dat and dist-info as well. That's up next.

You probably want to keep .dist-info. It's actually a functional part of a package, e.g. importlib.metadata uses it. And the license file is mandatory to keep when you're redistributing. .dist-info is also small, ~100 kb or so. If you really need to shave things off, I'd only remove RECORD since it's both the largest file and a not very important one within a Lambda layer.

keithrozario commented 6 months ago

Thanks. Unfortunately, I do not build the package from source, I merely pip install.

Will take your comment on keeping the dist-info, but I'll see if I can identify any _test_xxx.so files to be removed as well.

rgommers commented 6 months ago

I think this is the full list:

$ ls -l build/scipy/*/*.so | rg test
-rwxr-xr-x 1 rgommers rgommers    28664 17 mei 18:04 build/scipy/integrate/_test_multivariate.cpython-312-x86_64-linux-gnu.so
-rwxr-xr-x 1 rgommers rgommers   270968 17 mei 18:04 build/scipy/integrate/_test_odeint_banded.cpython-312-x86_64-linux-gnu.so
-rwxr-xr-x 1 rgommers rgommers   151456 17 mei 18:04 build/scipy/io/_test_fortran.cpython-312-x86_64-linux-gnu.so
-rwxr-xr-x 1 rgommers rgommers    52912 17 mei 18:04 build/scipy/_lib/_test_ccallback.cpython-312-x86_64-linux-gnu.so
-rwxr-xr-x 1 rgommers rgommers   158752 21 mei 13:45 build/scipy/_lib/_test_deprecation_call.cpython-312-x86_64-linux-gnu.so
-rwxr-xr-x 1 rgommers rgommers    92272 21 mei 13:45 build/scipy/_lib/_test_deprecation_def.cpython-312-x86_64-linux-gnu.so
-rwxr-xr-x 1 rgommers rgommers    31336 17 mei 18:04 build/scipy/ndimage/_ctest.cpython-312-x86_64-linux-gnu.so
-rwxr-xr-x 1 rgommers rgommers   386480 21 mei 13:45 build/scipy/ndimage/_cytest.cpython-312-x86_64-linux-gnu.so
-rwxr-xr-x 1 rgommers rgommers  1095216 21 mei 13:45 build/scipy/special/_test_internal.cpython-312-x86_64-linux-gnu.so
keithrozario commented 6 months ago

thanks -- the challenge from Klayers at least is that we need to make the script generic. I'm very hesitant to include package specific build steps for something like scipy, because maintaining that going forward would be difficult.

Although it sounds OK, deleting something like every file that meets the _test*.so might cause issues with other packages, but i would say the probability that someone has a runtime required .so file that begins with _test is very low.

Still pondering. Wonder what others are thinking.

dschmitz89 commented 6 months ago

Yep, this would be a nightmare to maintain in the long run.

I would be interested to test it out on a fork of this repo though without making a PR to your main repo. Any chance we can make that work?

aperture147 commented 6 months ago

You could try adding some specific script for specific library, like adding a file called scipy.sh to customize installation (by deleting unwanted files), then whenever you install scipy, you can check that is there any scipy.sh exists in the repo, if there is than use scipy.sh instead of plain pip install scipt to install to the layer.

I noticed that scipy and numpy are using GFortran and OpenBLAS, but both scipy and numpy are using a slightly different version of GFortran and OpenBLAS, which is separatedly stored as .so files in numpy.lib and scipy.lib directory. I'm thinking that there could be a way to make scipy and numpy use the same GFortran and OpenBLAS library, then we could save about 25M of size. Is there anyway to achieve this @rgommers? I'm not a guru on building static-linked library, especially using mesos build system. If we build this layer on amazonlinux2 and dynamically link some libraries which is already exists in the environment then we can shrink the layer even more.

aperture147 commented 6 months ago

Test layer is here:

arn:aws:lambda:ap-southeast-1:367660174341:layer:Klayers-p312-scipy:1

We perform the --no-compile flag to reduce the .pyc and pycache files, and also delete all directories marked 'tests', as recommended by experts on this thread :)

Feel free to run some test on the layer. If all goes well, I'll push this before end of this week into production, and we'll have 'optimized' builds going forward.

I can notice that having python byte-code truncated will increase cold start time. Should we keep those to reduce the cold start time or it's just me fiddling too much with the layer?

rgommers commented 6 months ago

I'm thinking that there could be a way to make scipy and numpy use the same GFortran and OpenBLAS library, then we could save about 25M of size. Is there anyway to achieve this @rgommers? I'm not a guru on building static-linked library, especially using mesos build system. If we build this layer on amazonlinux2 and dynamically link some libraries which is already exists in the environment then we can shrink the layer even more.

Not really when building the layer from wheels published to PyPI. NumPy uses 64-bit (ILP64) OpenBLAS, while SciPy uses 32-bit (LP64). We have a long-term plan to unify these two builds, but PyPI/wheels make this very complex. I would not recommend doing manual surgery here.

keithrozario commented 6 months ago

Test layer is here:

arn:aws:lambda:ap-southeast-1:367660174341:layer:Klayers-p312-scipy:1

We perform the --no-compile flag to reduce the .pyc and pycache files, and also delete all directories marked 'tests', as recommended by experts on this thread :)

Feel free to run some test on the layer. If all goes well, I'll push this before end of this week into production, and we'll have 'optimized' builds going forward.

I can notice that having python byte-code truncated will increase cold start time. Should we keep those to reduce the cold start time or it's just me fiddling too much with the layer?

Yes. Do you know how much slower the cold start time. Python will need to convert the .py into byte code, and that will incur some latency. For big packages this might be a lot, but not sure.

aperture147 commented 5 months ago

Yes. Do you know how much slower the cold start time. Python will need to convert the .py into byte code, and that will incur some latency. For big packages this might be a lot, but not sure.

Normally it only takes about 500ms to 1s to warm up the lambda, but now it takes 2s+ (sometimes up to 5s+ if I import all numpy, scipy and pandas) to turn it up (tested on 1024GiB RAM python 3.10 lambda function). It's bytecode compilation problem or it's just me doing too much surgeries on the layer.

keithrozario commented 5 months ago

No it's probably bytecode compilation. Let me think about this a bit more. Bytecode is major version specific, so should be shareable across functions even if the runtime is upgraded.

But bytecode also takes space, we have to trade off between space considerations and speed considerations. Nothing will work for everyone -- so my thoughts are to remove bytecode only if the package is large.

keithrozario commented 5 months ago

I love this conversation. I did a test today using just numpy. Comparing a layer that had __pycache__ vs a layer that didn't have __pycache__ on a 128MB function using Python 3.12

The findings:

With __pycache__ init times were: 635ms, 593ms, 637ms Without __pycache__ init times were: 677ms, 684ms, 708ms

Which suggest a ~50ms time penalty for compiling from py into .pyc. I think unless the package is huge (numpy is quite big already) you won't see any discernible performance gain. I think if you tweak the lambda settings like memory size, that performance difference would shrink even further.

Given this, if you're importing something like boto3, or requests, the difference is so little nobody will notice if the cache is included or not. For the larger packages like numpy and scipy, most (not all) will want to optimize for space, so that their own code or additional layers can be larger. Defaulting to removing pycache seems to be a logical decision.

So right now, we will remove .pyc files from all layers moving forward. Again, will not meet 100% of the requirements from everyone, but will meet the majority of users for the majority of times. Let me know your thoughts below.

Does that mean I can remove the need for separate packages for different versions of python??? Interesting....!!