Open menshikh-iv opened 6 years ago
It would be most convenient if we could build a minimal version of the distribution (including minimal scipy and numpy modules). I just managed to squeeze gensim onto a aws lambda using python3, but it was not easy :-)
@JensMadsen I think we can reduce the size of gensim distribution from ~15MB
to ~3-5MB
. Can you describe, what are you do for compressing?
Yes of course. I plan to write a blog post somewhere soon :-)
A not in details procedure for squeezing gensim into AWS lambda:
1) virtualenv --no-site-packages 2) strip .so files, but not all since some scipy break (https://github.com/pypa/manylinux/issues/119) 3) delete all tests in numpy, scipy, and gensim 4) just to be sure delete pycache
In that way I get a sufficiently small zip file
Actually what matters the most is to reduce the size of scipy which to my understanding has grown significantly lately
@JensMadsen thanks for the information! We have an old issue related to scipy - https://github.com/RaRe-Technologies/gensim/issues/557 (we want to develop a small tool for manipulating with sparse matrices and drop scipy as a dependency).
Can't wait to finally ditch scipy!
@JensMadsen Hi! Sorry to chime in, but did you ever write a blog post on getting gensim on AWS lambda? Trying to do that now, and gensim is quite...large, when creating a deployment package.
Thanks!
@JustinMoser sorry, no updates.
As "ad-hoc" solution, you can extract & drop test data (gensim/test/test_data
) from wheel manually (.whl
just an archive) and use it on lambda.
@menshikh-iv Thank you! Pardon me if I'm being dim, but when I install gensim to my deployment directory (using pip install gensim --target .), with the dependencies, it is near the 300mb mark.
@JustinMoser wow, that sounds impossible, for example, I made a clean installation on python2
-rw-r--r-- 1 ivan ivan 26575351 янв 16 20:47 scipy-1.2.0-cp27-cp27mu-manylinux1_x86_64.whl
-rw-r--r-- 1 ivan ivan 23630838 янв 16 20:47 gensim-3.6.0-cp27-cp27mu-manylinux1_x86_64.whl
-rw-r--r-- 1 ivan ivan 16961961 янв 16 20:47 numpy-1.16.0-cp27-cp27mu-manylinux1_x86_64.whl
-rw-r--r-- 1 ivan ivan 5213503 янв 16 20:47 botocore-1.12.79-py2.py3-none-any.whl
-rw-r--r-- 1 ivan ivan 1359202 янв 16 20:47 boto-2.49.0-py2.py3-none-any.whl
-rw-r--r-- 1 ivan ivan 543728 янв 16 20:47 docutils-0.14-py2-none-any.whl
-rw-r--r-- 1 ivan ivan 225696 янв 16 20:47 python_dateutil-2.7.5-py2.py3-none-any.whl
-rw-r--r-- 1 ivan ivan 154154 янв 16 20:47 certifi-2018.11.29-py2.py3-none-any.whl
-rw-r--r-- 1 ivan ivan 133356 янв 16 20:47 chardet-3.0.4-py2.py3-none-any.whl
-rw-r--r-- 1 ivan ivan 128504 янв 16 20:47 boto3-1.9.79-py2.py3-none-any.whl
-rw-r--r-- 1 ivan ivan 118086 янв 16 20:47 urllib3-1.24.1-py2.py3-none-any.whl
-rw-r--r-- 1 ivan ivan 59642 янв 16 20:47 s3transfer-0.1.13-py2.py3-none-any.whl
-rw-r--r-- 1 ivan ivan 58594 янв 16 20:47 idna-2.8-py2.py3-none-any.whl
-rw-r--r-- 1 ivan ivan 57987 янв 16 20:47 requests-2.21.0-py2.py3-none-any.whl
-rw-r--r-- 1 ivan ivan 23497 янв 16 20:47 jmespath-0.9.3-py2.py3-none-any.whl
-rw-r--r-- 1 ivan ivan 15847 янв 16 20:47 futures-3.2.0-py2-none-any.whl
-rw-r--r-- 1 ivan ivan 10586 янв 16 20:47 six-1.12.0-py2.py3-none-any.whl
gensim with all deps takes around 72M, where 300MB comes from? Can you check please, what exactly downloaded?
if you talking about installed, so, in that case, numpy & scipy still top2 (more than 150MB)
285756 bbbbbb/lib/python2.7
285368 bbbbbb/lib/python2.7/site-packages
98708 bbbbbb/lib/python2.7/site-packages/scipy
70672 bbbbbb/lib/python2.7/site-packages/numpy
41116 bbbbbb/lib/python2.7/site-packages/gensim
39532 bbbbbb/lib/python2.7/site-packages/scipy/.libs
39516 bbbbbb/lib/python2.7/site-packages/botocore
34944 bbbbbb/lib/python2.7/site-packages/botocore/data
32152 bbbbbb/lib/python2.7/site-packages/gensim/test
30704 bbbbbb/lib/python2.7/site-packages/gensim/test/test_data
30072 bbbbbb/lib/python2.7/site-packages/numpy/.libs
25736 bbbbbb/lib/python2.7/site-packages/numpy/core
11056 bbbbbb/lib/python2.7/site-packages/boto
9136 bbbbbb/lib/python2.7/site-packages/scipy/special
unfortunatelly, I can't help with it
@JustinMoser I dropped lambdas. too much hazzle. Doing a service in a kubernetes cluster instead :-) This is the content of my dockwer file from back then:
# Use an official Python runtime as a parent image
FROM amazonlinux:1
# install python 36
RUN yum -y install python36 python36-pip python36-setuptools python36-virtualenv
# install requirements for gensim
RUN yum -y install git
RUN yum -y install zip
RUN yum -y install gcc
RUN yum -y install gcc-gfortran
RUN yum -y install gcc-c++
RUN yum -y install blas-devel
RUN yum -y install lapack-devel
RUN yum -y install atlas-devel
# create virtual env for lambda function
RUN python3 -m virtualenv d2v_env --no-site-packages --always-copy
RUN source d2v_env/bin/activate
# copy python files into docker
RUN mkdir d2v_infer
ADD *.py d2v_infer/
ADD requirements.txt .
# install gensim
#RUN source d2v_env/bin/activate && pip install --use-wheel gensim
RUN source d2v_env/bin/activate && pip install -r requirements.txt
# strip to save space. This is neccessary due to bugs in numpy and scipy packages https://github.com/pypa/manylinux/issues/119
RUN cd d2v_env/lib64/python3.6/site-packages/ && find . -name "*.so" | grep -v ufuncs | grep -v fblas | grep -v flapack | grep -v cython_blas | grep -v cython_lapack | grep -v ellip_harm | grep -v odepack | grep -v quadpack | grep -v vode | grep -v lsoda | grep -v iterative | grep -v superlu | grep -v arpack | grep -v trlib | grep -v lbfgs | grep -v qhull | xargs strip
# get lib files
RUN mkdir d2v_infer/lib
RUN find /usr/lib64 -name "libblas.*" -exec cp -P {} d2v_infer/lib/ \;
RUN find /usr/lib64 -name "libgfortran.*" -exec cp -P {} d2v_infer/lib/ \;
RUN find /usr/lib64 -name "liblapack.*" -exec cp -P {} d2v_infer/lib/ \;
RUN find /usr/lib64 -name "libopenblas.*" -exec cp -P {} d2v_infer/lib/ \;
RUN find /usr/lib64 -name "libquadmath.*" -exec cp -P {} d2v_infer/lib/ \;
RUN find /usr/lib64 -name "libf77blas.*" -exec cp -P {} d2v_infer/lib/ \;
RUN find /usr/lib64 -name "libcblas.*" -exec cp -P {} d2v_infer/lib/ \;
RUN find /usr/lib64 -name "libatlas.*" -exec cp -P {} d2v_infer/lib/ \;
# Copy dependencies
RUN cp -r d2v_env/lib/python3.6/site-packages/six* /d2v_infer/
RUN cp -r d2v_env/lib/python3.6/site-packages/bz2file* /d2v_infer/
RUN cp -r d2v_env/lib/python3.6/site-packages/boto* /d2v_infer/
RUN cp -r d2v_env/lib/python3.6/site-packages/idna* /d2v_infer/
RUN cp -r d2v_env/lib/python3.6/site-packages/chardet* /d2v_infer/
RUN cp -r d2v_env/lib/python3.6/site-packages/urllib3* /d2v_infer/
RUN cp -r d2v_env/lib/python3.6/site-packages/certifi* /d2v_infer/
RUN cp -r d2v_env/lib/python3.6/site-packages/requests* /d2v_infer/
RUN cp -r d2v_env/lib/python3.6/site-packages/python_dateutil-2.7.3.dist-info/ /d2v_infer/
RUN cp -r d2v_env/lib/python3.6/site-packages/docutils* /d2v_infer/.
RUN cp -r d2v_env/lib/python3.6/site-packages/jmespath /d2v_infer/.
RUN cp -r d2v_env/lib/python3.6/site-packages/boto* /d2v_infer/.
RUN cp -r d2v_env/lib/python3.6/site-packages/s3transfer* /d2v_infer/.
RUN cp -r d2v_env/lib/python3.6/site-packages/smart_open* /d2v_infer/.
RUN cp -r /d2v_env/lib/python3.6/site-packages/dateutil/ /d2v_infer/.
RUN cp -r d2v_env/lib64/python3.6/site-packages/* /d2v_infer/.
# delete __pycache__ if exists
RUN cd d2v_infer && find . -type d -name __pycache__ -exec rm -r {} \+
# Delete tests to reduce size
RUN cd d2v_infer && find . -type d -name tests -exec rm -r {} \+
RUN cd d2v_infer && find . -type d -name test -exec rm -r {} \+
# zip it up
RUN cd /d2v_infer && zip -r -q /d2v.zip ./*
# aws s3 cp gensim_dist.zip s3://onlaw-d2v/deployment_packages.zip
Right now, size of gensim
wheel
/tar.gz
is~16MB
, this is less than50MB+
, but still huge. Need to "cut" big files that used for tests and rewrite the affected testsPrevious issue #1698