piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.64k stars 4.37k forks source link

Reduce gensim distribution size #1783

Open menshikh-iv opened 6 years ago

menshikh-iv commented 6 years ago

Right now, size of gensim wheel/tar.gz is ~16MB, this is less than 50MB+, but still huge. Need to "cut" big files that used for tests and rewrite the affected tests

Previous issue #1698

JensMadsen commented 6 years ago

It would be most convenient if we could build a minimal version of the distribution (including minimal scipy and numpy modules). I just managed to squeeze gensim onto a aws lambda using python3, but it was not easy :-)

menshikh-iv commented 6 years ago

@JensMadsen I think we can reduce the size of gensim distribution from ~15MB to ~3-5MB. Can you describe, what are you do for compressing?

JensMadsen commented 6 years ago

Yes of course. I plan to write a blog post somewhere soon :-)

A not in details procedure for squeezing gensim into AWS lambda:

1) virtualenv --no-site-packages 2) strip .so files, but not all since some scipy break (https://github.com/pypa/manylinux/issues/119) 3) delete all tests in numpy, scipy, and gensim 4) just to be sure delete pycache

In that way I get a sufficiently small zip file

Actually what matters the most is to reduce the size of scipy which to my understanding has grown significantly lately

menshikh-iv commented 6 years ago

@JensMadsen thanks for the information! We have an old issue related to scipy - https://github.com/RaRe-Technologies/gensim/issues/557 (we want to develop a small tool for manipulating with sparse matrices and drop scipy as a dependency).

piskvorky commented 6 years ago

Can't wait to finally ditch scipy!

JustinMoser commented 5 years ago

@JensMadsen Hi! Sorry to chime in, but did you ever write a blog post on getting gensim on AWS lambda? Trying to do that now, and gensim is quite...large, when creating a deployment package.

Thanks!

menshikh-iv commented 5 years ago

@JustinMoser sorry, no updates.

As "ad-hoc" solution, you can extract & drop test data (gensim/test/test_data) from wheel manually (.whl just an archive) and use it on lambda.

JustinMoser commented 5 years ago

@menshikh-iv Thank you! Pardon me if I'm being dim, but when I install gensim to my deployment directory (using pip install gensim --target .), with the dependencies, it is near the 300mb mark.

menshikh-iv commented 5 years ago

@JustinMoser wow, that sounds impossible, for example, I made a clean installation on python2

-rw-r--r-- 1 ivan ivan 26575351 янв 16 20:47 scipy-1.2.0-cp27-cp27mu-manylinux1_x86_64.whl
-rw-r--r-- 1 ivan ivan 23630838 янв 16 20:47 gensim-3.6.0-cp27-cp27mu-manylinux1_x86_64.whl
-rw-r--r-- 1 ivan ivan 16961961 янв 16 20:47 numpy-1.16.0-cp27-cp27mu-manylinux1_x86_64.whl
-rw-r--r-- 1 ivan ivan  5213503 янв 16 20:47 botocore-1.12.79-py2.py3-none-any.whl
-rw-r--r-- 1 ivan ivan  1359202 янв 16 20:47 boto-2.49.0-py2.py3-none-any.whl
-rw-r--r-- 1 ivan ivan   543728 янв 16 20:47 docutils-0.14-py2-none-any.whl
-rw-r--r-- 1 ivan ivan   225696 янв 16 20:47 python_dateutil-2.7.5-py2.py3-none-any.whl
-rw-r--r-- 1 ivan ivan   154154 янв 16 20:47 certifi-2018.11.29-py2.py3-none-any.whl
-rw-r--r-- 1 ivan ivan   133356 янв 16 20:47 chardet-3.0.4-py2.py3-none-any.whl
-rw-r--r-- 1 ivan ivan   128504 янв 16 20:47 boto3-1.9.79-py2.py3-none-any.whl
-rw-r--r-- 1 ivan ivan   118086 янв 16 20:47 urllib3-1.24.1-py2.py3-none-any.whl
-rw-r--r-- 1 ivan ivan    59642 янв 16 20:47 s3transfer-0.1.13-py2.py3-none-any.whl
-rw-r--r-- 1 ivan ivan    58594 янв 16 20:47 idna-2.8-py2.py3-none-any.whl
-rw-r--r-- 1 ivan ivan    57987 янв 16 20:47 requests-2.21.0-py2.py3-none-any.whl
-rw-r--r-- 1 ivan ivan    23497 янв 16 20:47 jmespath-0.9.3-py2.py3-none-any.whl
-rw-r--r-- 1 ivan ivan    15847 янв 16 20:47 futures-3.2.0-py2-none-any.whl
-rw-r--r-- 1 ivan ivan    10586 янв 16 20:47 six-1.12.0-py2.py3-none-any.whl

gensim with all deps takes around 72M, where 300MB comes from? Can you check please, what exactly downloaded?

if you talking about installed, so, in that case, numpy & scipy still top2 (more than 150MB)

285756  bbbbbb/lib/python2.7
285368  bbbbbb/lib/python2.7/site-packages
98708   bbbbbb/lib/python2.7/site-packages/scipy
70672   bbbbbb/lib/python2.7/site-packages/numpy
41116   bbbbbb/lib/python2.7/site-packages/gensim
39532   bbbbbb/lib/python2.7/site-packages/scipy/.libs
39516   bbbbbb/lib/python2.7/site-packages/botocore
34944   bbbbbb/lib/python2.7/site-packages/botocore/data
32152   bbbbbb/lib/python2.7/site-packages/gensim/test
30704   bbbbbb/lib/python2.7/site-packages/gensim/test/test_data
30072   bbbbbb/lib/python2.7/site-packages/numpy/.libs
25736   bbbbbb/lib/python2.7/site-packages/numpy/core
11056   bbbbbb/lib/python2.7/site-packages/boto
9136    bbbbbb/lib/python2.7/site-packages/scipy/special

unfortunatelly, I can't help with it

JensMadsen commented 5 years ago

@JustinMoser I dropped lambdas. too much hazzle. Doing a service in a kubernetes cluster instead :-) This is the content of my dockwer file from back then:

# Use an official Python runtime as a parent image
FROM amazonlinux:1

# install python 36
RUN yum -y install python36 python36-pip python36-setuptools python36-virtualenv

# install requirements for gensim
RUN yum -y install git
RUN yum -y install zip
RUN yum -y install gcc
RUN yum -y install gcc-gfortran 
RUN yum -y install gcc-c++ 
RUN yum -y install blas-devel 
RUN yum -y install lapack-devel 
RUN yum -y install atlas-devel

# create virtual env for lambda function
RUN python3 -m virtualenv d2v_env --no-site-packages --always-copy
RUN source d2v_env/bin/activate

# copy python files into docker
RUN mkdir d2v_infer
ADD *.py d2v_infer/
ADD requirements.txt .

# install gensim 
#RUN source d2v_env/bin/activate && pip install --use-wheel gensim
RUN source d2v_env/bin/activate && pip install -r requirements.txt

# strip to save space. This is neccessary due to bugs in numpy and scipy packages https://github.com/pypa/manylinux/issues/119
RUN cd d2v_env/lib64/python3.6/site-packages/ && find . -name "*.so" | grep -v ufuncs | grep -v fblas | grep -v flapack | grep -v cython_blas | grep -v cython_lapack | grep -v ellip_harm | grep -v odepack | grep -v quadpack | grep -v vode | grep -v lsoda | grep -v iterative | grep -v superlu | grep -v arpack | grep -v trlib | grep -v lbfgs | grep -v qhull | xargs strip

# get lib files
RUN mkdir d2v_infer/lib
RUN find /usr/lib64 -name "libblas.*" -exec cp -P {} d2v_infer/lib/ \;
RUN find /usr/lib64 -name "libgfortran.*" -exec cp -P {} d2v_infer/lib/ \;
RUN find /usr/lib64 -name "liblapack.*" -exec cp -P {} d2v_infer/lib/ \;
RUN find /usr/lib64 -name "libopenblas.*" -exec cp -P {} d2v_infer/lib/ \;
RUN find /usr/lib64 -name "libquadmath.*" -exec cp -P {} d2v_infer/lib/ \;
RUN find /usr/lib64 -name "libf77blas.*" -exec cp -P {} d2v_infer/lib/ \;
RUN find /usr/lib64 -name "libcblas.*" -exec cp -P {} d2v_infer/lib/ \;
RUN find /usr/lib64 -name "libatlas.*" -exec cp -P {} d2v_infer/lib/ \;

# Copy dependencies 
RUN cp -r d2v_env/lib/python3.6/site-packages/six* /d2v_infer/
RUN cp -r d2v_env/lib/python3.6/site-packages/bz2file* /d2v_infer/
RUN cp -r d2v_env/lib/python3.6/site-packages/boto* /d2v_infer/
RUN cp -r d2v_env/lib/python3.6/site-packages/idna* /d2v_infer/
RUN cp -r d2v_env/lib/python3.6/site-packages/chardet* /d2v_infer/
RUN cp -r d2v_env/lib/python3.6/site-packages/urllib3* /d2v_infer/
RUN cp -r d2v_env/lib/python3.6/site-packages/certifi* /d2v_infer/
RUN cp -r d2v_env/lib/python3.6/site-packages/requests* /d2v_infer/
RUN cp -r d2v_env/lib/python3.6/site-packages/python_dateutil-2.7.3.dist-info/ /d2v_infer/
RUN cp -r d2v_env/lib/python3.6/site-packages/docutils* /d2v_infer/.
RUN cp -r d2v_env/lib/python3.6/site-packages/jmespath /d2v_infer/.
RUN cp -r d2v_env/lib/python3.6/site-packages/boto* /d2v_infer/.
RUN cp -r d2v_env/lib/python3.6/site-packages/s3transfer* /d2v_infer/.
RUN cp -r d2v_env/lib/python3.6/site-packages/smart_open* /d2v_infer/.
RUN cp -r /d2v_env/lib/python3.6/site-packages/dateutil/ /d2v_infer/.
RUN cp -r d2v_env/lib64/python3.6/site-packages/* /d2v_infer/.

# delete __pycache__ if exists
RUN cd d2v_infer && find . -type d -name __pycache__ -exec rm -r {} \+

# Delete tests to reduce size
RUN cd d2v_infer && find . -type d -name tests -exec rm -r {} \+
RUN cd d2v_infer && find . -type d -name test -exec rm -r {} \+

# zip it up 
RUN cd /d2v_infer && zip -r -q /d2v.zip ./*

# aws s3 cp gensim_dist.zip s3://onlaw-d2v/deployment_packages.zip