Open explomind1 opened 1 year ago
If you're asking about the office AWS layer, I don't really know.
We can try to add Scipy for 3.10 here, but we may run into the size MB limit, which is a hard limit that can't be worked around.
Scipy wheels are roughly 30-40 MB in size lately: https://github.com/scipy/scipy/releases/tag/v1.11.4 . Does that seem too much?
I would like to see if I can help out with this issue. As regular SciPy contributor, I am familiar with the scipy tooling and I use Lambda at my day job but I am pretty new to Lambda layer creation. Do you have old scripts for SciPy still lying around?
If someone would make a pull request to add these packages, then I'll merge them and automatically build :)
I tried building SciPy for Lambda, but currently it's size exceeds the accepted in Lambda.
Lambda has a limit of 50MB, and ScIPy size is above that (~57MB). Note this is the result of a pip install scipy ... which includes not just SciPy but numpy as well.
I will see if we can remove the cache files to resuce the size, but at the moment this is the size :(
Currently the output looks like this:
I will experiment with removing pycache directories --- and just keeping the pycache directories to see what happens.
IIRC the size limit is 250 MB unzipped, rather than 50 MB on upload.
You can significantly cut down the size by deleting all the tests/
directories. Also, you probably don't need the 3 scipy/misc/*.dat
test images and they are large. Deleting all that may cut the package size by ~25% or so.
It used to be possible to get numpy/scipy/pandas in a single layer. I'd be curious what the status is now.
Thanks, I'll check and see if it's possible. But there's a lot of bespoke effort that may be unsustainable.
The Lambda limit is 50MB zipped, and currently the total zipped size is bigger than that :(.
I am also interested in a scipy layer for 3.10+, and can't find a workaround for the size limit. I am not sure if you already do this but running something like find . | grep -E "(/tests$|__pycache__$|\.pyc$|\.pyo$)" | xargs rm -rf
before zipping gets rid of files that are not needed in the layer. If that fails then all you can do is install submodules of scipy separately as needed which is not ideal
Friendly ping: was there any progress here? For the custom removal of code, is it possible to automatically inject such package specific code into the whole terraform build script?
If someone could modify the build function, that'd be much appreciated :). I think for now we can remove all pycache files to save space, that may help.
Not sure if this would help at all but this saved a lot of space when building the layer.
docker run -v "$PWD":/var/task "public.ecr.aws/sam/build-python3.9" /bin/sh -c "pip install -r requirements.txt --platform manylinux2014_x86_64 --implementation cp --python 3.9 --only-binary=:all: --upgrade --trusted-host pypi.org --trusted-host files.pythonhosted.org -t python/lib/python3.9/site-packages/; exit"
IIRC the size limit is 250 MB unzipped, rather than 50 MB on upload.
You can significantly cut down the size by deleting all the
tests/
directories. Also, you probably don't need the 3scipy/misc/*.dat
test images and they are large. Deleting all that may cut the package size by ~25% or so.It used to be possible to get numpy/scipy/pandas in a single layer. I'd be curious what the status is now.
Tested and it works. I also added --no-compile
and delete all dist-info
directory and now NumPy, SciPy and Pandas can be placed in a single layer. All of them take approx 195M, so I can have extra 50M for all of my imagination.
This is one of the most hilarious black magic I've ever seen.
Wow. I need to find someway to automate this. What does --no-compile do?
@keithrozario we have just implemented proper support for this in NumPy, via "install tags" in the build system. Here is how to use it: https://github.com/numpy/numpy/issues/26289#issuecomment-2068943795. I'm planning to do the same for SciPy. It would come down to adding -Cinstall-args="--tags=runtime,devel,python-runtime"
to your pip install
(or pip wheel
or python -m build
) invocation in order to drop the test suite.
--no-compile
is a pip
flag: https://pip.pypa.io/en/stable/cli/pip_install/#cmdoption-no-compile
That together should make all this a one-liner. It should work for NumPy now.
Wow. I need to find someway to automate this. What does --no-compile do?
It will not precompile python code into byte code during the install process. But the test suites are those which consuming a lot of megabytes, the byte code takes just a few megabytes at most.
My approach is summed up in this script:
# install cpython implementation only
pip install numpy pandas scipy --no-compile --implementation cp -t python
# remove all dist-info
rm -r *.dist-info
# delete all tests directories
find . | grep -E "*/tests$" | xargs rm -rf
# clean up python byte code if any
find . | grep -E "(/__pycache__$|\.pyc$|\.pyo$)" | xargs rm -rf
# Xoá cả pyproject vì không cần đến
find . | grep -E "pyproject.toml$" | xargs rm -rf
# delete unused .dat file which is deprecated since scipy 1.10
find . | grep -E "scipy\misc\*.dat$" | xargs rm -rf
Btw i think modifying the bundled source code is not a good practice tho.
@keithrozario we have just implemented proper support for this in NumPy, via "install tags" in the build system. Here is how to use it: numpy/numpy#26289 (comment). I'm planning to do the same for SciPy. It would come down to adding
-Cinstall-args="--tags=runtime,devel,python-runtime"
to yourpip install
(orpip wheel
orpython -m build
) invocation in order to drop the test suite.
--no-compile
is apip
flag: https://pip.pypa.io/en/stable/cli/pip_install/#cmdoption-no-compileThat together should make all this a one-liner. It should work for NumPy now.
I don't get why numpy and scipy has their test suite in the wheel, when they don't contribute anything on the run process. I thought it was the sanity check in every import
but it's just the test package during the meson build phase. It's bummer to rebuild numpy just to get rid of the test suite.
@aperture147 that's historical. Once upon a time, many more users built from source. And then it was critical to be able to run tests with numpy.test()
in order to diagnose all sorts of weird issues. Having tests in tests/
subfolders of the package used to be very common, maybe even the standard way for where to store tests.
For new projects started today, the test suite usually goes outside of the importable tree. Moving everything in numpy now though would be very disruptive, as it would (among other things) make all open PRs unmerge-able.
Thank you so much all. I'll look into this next week or so, and hopefully we can get a layer of scipy out!!!
I'm not sure how much of this is generic (can be applied to all packages) and how much is specific to scipy though. Will have to think a bit more.
AFAIK, SciPy and NumPy are safe to have tests
directory removed. NumPy tests
directory size is even larger than SciPy.
Test layer is here:
arn:aws:lambda:ap-southeast-1:367660174341:layer:Klayers-p312-scipy:1
We perform the --no-compile flag to reduce the .pyc and pycache files, and also delete all directories marked 'tests', as recommended by experts on this thread :)
Feel free to run some test on the layer. If all goes well, I'll push this before end of this week into production, and we'll have 'optimized' builds going forward.
I forgot to remove .dat and dist-info as well. That's up next.
I'll need to write docs for it, but this command will already remove test data is well as some large-ish _test_xxx.so
extension modules that live outside of the tests/
directories:
$ python -m build -wnx -Cinstall-args=--tags=runtime,python-runtime,devel
It's available in SciPy's main
branch since a week ago (https://github.com/scipy/scipy/pull/20712).
I forgot to remove .dat and dist-info as well. That's up next.
You probably want to keep .dist-info
. It's actually a functional part of a package, e.g. importlib.metadata
uses it. And the license file is mandatory to keep when you're redistributing. .dist-info
is also small, ~100 kb or so. If you really need to shave things off, I'd only remove RECORD
since it's both the largest file and a not very important one within a Lambda layer.
Thanks. Unfortunately, I do not build the package from source, I merely pip install.
Will take your comment on keeping the dist-info, but I'll see if I can identify any _test_xxx.so files to be removed as well.
I think this is the full list:
$ ls -l build/scipy/*/*.so | rg test
-rwxr-xr-x 1 rgommers rgommers 28664 17 mei 18:04 build/scipy/integrate/_test_multivariate.cpython-312-x86_64-linux-gnu.so
-rwxr-xr-x 1 rgommers rgommers 270968 17 mei 18:04 build/scipy/integrate/_test_odeint_banded.cpython-312-x86_64-linux-gnu.so
-rwxr-xr-x 1 rgommers rgommers 151456 17 mei 18:04 build/scipy/io/_test_fortran.cpython-312-x86_64-linux-gnu.so
-rwxr-xr-x 1 rgommers rgommers 52912 17 mei 18:04 build/scipy/_lib/_test_ccallback.cpython-312-x86_64-linux-gnu.so
-rwxr-xr-x 1 rgommers rgommers 158752 21 mei 13:45 build/scipy/_lib/_test_deprecation_call.cpython-312-x86_64-linux-gnu.so
-rwxr-xr-x 1 rgommers rgommers 92272 21 mei 13:45 build/scipy/_lib/_test_deprecation_def.cpython-312-x86_64-linux-gnu.so
-rwxr-xr-x 1 rgommers rgommers 31336 17 mei 18:04 build/scipy/ndimage/_ctest.cpython-312-x86_64-linux-gnu.so
-rwxr-xr-x 1 rgommers rgommers 386480 21 mei 13:45 build/scipy/ndimage/_cytest.cpython-312-x86_64-linux-gnu.so
-rwxr-xr-x 1 rgommers rgommers 1095216 21 mei 13:45 build/scipy/special/_test_internal.cpython-312-x86_64-linux-gnu.so
thanks -- the challenge from Klayers at least is that we need to make the script generic. I'm very hesitant to include package specific build steps for something like scipy, because maintaining that going forward would be difficult.
Although it sounds OK, deleting something like every file that meets the _test*.so might cause issues with other packages, but i would say the probability that someone has a runtime required .so file that begins with _test is very low.
Still pondering. Wonder what others are thinking.
Yep, this would be a nightmare to maintain in the long run.
I would be interested to test it out on a fork of this repo though without making a PR to your main repo. Any chance we can make that work?
You could try adding some specific script for specific library, like adding a file called scipy.sh
to customize installation (by deleting unwanted files), then whenever you install scipy
, you can check that is there any scipy.sh
exists in the repo, if there is than use scipy.sh
instead of plain pip install scipt
to install to the layer.
I noticed that scipy
and numpy
are using GFortran
and OpenBLAS
, but both scipy
and numpy
are using a slightly different version of GFortran
and OpenBLAS
, which is separatedly stored as .so
files in numpy.lib
and scipy.lib
directory. I'm thinking that there could be a way to make scipy
and numpy
use the same GFortran
and OpenBLAS
library, then we could save about 25M of size. Is there anyway to achieve this @rgommers? I'm not a guru on building static-linked library, especially using mesos
build system. If we build this layer on amazonlinux2
and dynamically link some libraries which is already exists in the environment then we can shrink the layer even more.
Test layer is here:
arn:aws:lambda:ap-southeast-1:367660174341:layer:Klayers-p312-scipy:1
We perform the --no-compile flag to reduce the .pyc and pycache files, and also delete all directories marked 'tests', as recommended by experts on this thread :)
Feel free to run some test on the layer. If all goes well, I'll push this before end of this week into production, and we'll have 'optimized' builds going forward.
I can notice that having python byte-code truncated will increase cold start time. Should we keep those to reduce the cold start time or it's just me fiddling too much with the layer?
I'm thinking that there could be a way to make
scipy
andnumpy
use the sameGFortran
andOpenBLAS
library, then we could save about 25M of size. Is there anyway to achieve this @rgommers? I'm not a guru on building static-linked library, especially usingmesos
build system. If we build this layer onamazonlinux2
and dynamically link some libraries which is already exists in the environment then we can shrink the layer even more.
Not really when building the layer from wheels published to PyPI. NumPy uses 64-bit (ILP64) OpenBLAS, while SciPy uses 32-bit (LP64). We have a long-term plan to unify these two builds, but PyPI/wheels make this very complex. I would not recommend doing manual surgery here.
Test layer is here:
arn:aws:lambda:ap-southeast-1:367660174341:layer:Klayers-p312-scipy:1
We perform the --no-compile flag to reduce the .pyc and pycache files, and also delete all directories marked 'tests', as recommended by experts on this thread :)
Feel free to run some test on the layer. If all goes well, I'll push this before end of this week into production, and we'll have 'optimized' builds going forward.
I can notice that having python byte-code truncated will increase cold start time. Should we keep those to reduce the cold start time or it's just me fiddling too much with the layer?
Yes. Do you know how much slower the cold start time. Python will need to convert the .py into byte code, and that will incur some latency. For big packages this might be a lot, but not sure.
Yes. Do you know how much slower the cold start time. Python will need to convert the .py into byte code, and that will incur some latency. For big packages this might be a lot, but not sure.
Normally it only takes about 500ms to 1s to warm up the lambda, but now it takes 2s+ (sometimes up to 5s+ if I import all numpy, scipy and pandas) to turn it up (tested on 1024GiB RAM python 3.10 lambda function). It's bytecode compilation problem or it's just me doing too much surgeries on the layer.
No it's probably bytecode compilation. Let me think about this a bit more. Bytecode is major version specific, so should be shareable across functions even if the runtime is upgraded.
But bytecode also takes space, we have to trade off between space considerations and speed considerations. Nothing will work for everyone -- so my thoughts are to remove bytecode only if the package is large.
I love this conversation. I did a test today using just numpy
. Comparing a layer that had __pycache__
vs a layer that didn't have __pycache__
on a 128MB function using Python 3.12
The findings:
With __pycache__
init times were: 635ms, 593ms, 637ms
Without __pycache__
init times were: 677ms, 684ms, 708ms
Which suggest a ~50ms time penalty for compiling from py into .pyc. I think unless the package is huge (numpy is quite big already) you won't see any discernible performance gain. I think if you tweak the lambda settings like memory size, that performance difference would shrink even further.
Given this, if you're importing something like boto3, or requests, the difference is so little nobody will notice if the cache is included or not. For the larger packages like numpy and scipy, most (not all) will want to optimize for space, so that their own code or additional layers can be larger. Defaulting to removing pycache seems to be a logical decision.
So right now, we will remove .pyc files from all layers moving forward. Again, will not meet 100% of the requirements from everyone, but will meet the majority of users for the majority of times. Let me know your thoughts below.
Does that mean I can remove the need for separate packages for different versions of python??? Interesting....!!
Since the current AWS lambda layers doesnt support scipy only on 3.9 and above, it would be great if we could create an arn for scipy as well. Does anyone when will there be a aws layer for scipy for python 3.9 and 3.10
I have tried creating a custom layer for scipy that supports 3.9 o 3.10, however, it always gives C-extension error or says the the scipy module is broken when i try to create it from the cloud9 ide without numpy and then upload back to lambda. Moreover, it is not possible to add scipy from the cloud9 as well because it is above the mb limit that lambda can handle (the only way is to delete the numpy directories and scipy can be succesfully installed to lambda without any errors.
I would really appreciate it if anyone knows when will AWS will provide a aws layer just like in 3.7 or 3.8.