metaspace2020 / Lithops-METASPACE

Lithops-based Serverless implementation of the METASPACE spatial metabolomics annotation pipeline
12 stars 4 forks source link

reduce metaspace2020 in Docker #39

Closed gilv closed 5 years ago

gilv commented 5 years ago

PyWren need annotation package in order to run our demo notebooks. So far we use Docker image that includes

  RUN pip install metaspace2020

This command install annotation package. However this also install all kind of unnecessary packages, like Matplotlib , etc. which is about 150MB. We need to find a way how to add annotation runtime and avoid installing Matplotlib.

LachlanStuart commented 5 years ago

@gilv That was only added recently without a PR, and I don't think it should have been added.

The metaspace2020 package is only used by the notebooks for validation & visualization. It's not used at all by the Functions that do processing in IBM Cloud. It shouldn't be needed in the runtime.

Is there any way to stop PyWren from trying to include these dependencies? I think this is also related to the issue that @omerb01 raised: https://github.com/metaspace2020/pywren-annotation-pipeline/issues/24 where PyWren tries to include Jupyter.

omerb01 commented 5 years ago

@LachlanStuart @gilv I did some tests and found that PyWren serialise meatspace module because of this line: https://github.com/metaspace2020/pywren-annotation-pipeline/blob/20e9db6335a229452d599527922a5fa3de68e914/annotation_pipeline/check_results.py#L7 because it needs to serialise annotation_pipeline module which uses metaspace inside. PyWren map function has exclude_modules parameter to avoid serialising by force. I can suggest to add for every map function a list of modules by config.json to do so. how does it sound?

LachlanStuart commented 5 years ago

@omerb01 Are you suggesting a different list of exclude_modules for each function, or just one list for all functions? I would prefer one list for all functions for simplicity, if it's possible.

It's a shame that PyWren doesn't have the opposite logic - an include_modules list would be shorter than an exclude_modules list, and it wouldn't need to be updated every time some new visualization library(or similar) is added to the notebooks.

Regarding where to put it - I would suggest adding keeping the list as a constant variable somewhere in code so that it's automatically synchronized via git. config.json isn't checked into git, so it requires people to manually update it, which means people would occasionally hit issues if they forget to do it.

omerb01 commented 5 years ago

@LachlanStuart yes, I suggest just one list for all functions. include_modules parameter suggestion sounds great, will discuss about it PyWren side.

JosepSampe commented 5 years ago

@gilv @omerb01 @LachlanStuart I recently added a new Dockerfile which generates a slim image for pywren. It only takes 307MB compared to 1,2GB of the default image ibmfunctions/action-python-v3.6. So I suggest to test that Dockerfile, including all your needed packages, if you want to reduce final image size.

JosepSampe commented 5 years ago

@omerb01 Just out of curios: What was the size of he old image and what is the size of the current image by using the Dockerfile.slim-python36?

omerb01 commented 5 years ago

@JosepSampe 342.12 MB for the old one 140.48 MB for the new slim I took this info from my Docker account and it includes the modules for annotation-pipeline project

omerb01 commented 5 years ago

due to #45 we can close this