Closed sergii-mamedov closed 5 months ago
Changes are needed to run the pipeline locally with docker. Given that two buckets were created, it is needed to change the config files and main sm-engine Dockerfile script to create it. i.e https://github.com/metaspace2020/metaspace/blob/c74d385213dd3c5f87ba57c2cd8d4c6262b913c6/metaspace/engine/conf/config.json.template#L194 https://github.com/metaspace2020/metaspace/blob/c74d385213dd3c5f87ba57c2cd8d4c6262b913c6/docker/docker-compose.yml#L205
Changes are needed to run the pipeline locally with docker. Given that two buckets were created, it is needed to change the config files and main sm-engine Dockerfile script to create it. i.e https://github.com/metaspace2020/metaspace/blob/c74d385213dd3c5f87ba57c2cd8d4c6262b913c6/metaspace/engine/conf/config.json.template#L194 https://github.com/metaspace2020/metaspace/blob/c74d385213dd3c5f87ba57c2cd8d4c6262b913c6/docker/docker-compose.yml#L205
Will be fixed in separate PR.
Overview
The goal of this PR is the final migration from IBM Cloud to AWS (transfer everything related to the lithops pipeline). The main milestones are listed and briefly described below.
Docker images
We have a separate
Dockerfiles
for AWS EC2 and AWS Lambda from which we create Docker images. For EC2, the docker image saved on Docker Hub, for Lambda on AWS ECR.AWS EC2
We continue to use a
consume
mode for now. This approach has its pros and cons. EC2 instance startup time is about 30 seconds, which should be faster than for acreate
mode. Unfortunately,spot
instances not available forconsume
mode. It is worth investigating whethercreate
mode will be relevant for us in terms of saving money (due to the possibility of using spot instances). However, I changed the logic of using EC2 instances. Previously, we only had one virtual machine with 128 or 256 GB of RAM. Due to the 10GB RAM limit for the Lambda function, we will be using EC2 more often. So I created four separate EC2 instances with 32, 64, 128 and 256 GB of RAM and lithops executor will run over these instances with the least amount of memory. Due to the fact that we are using an old version of python (3.8), we are forced to use Ubuntu 20.04. There are some dependency issues for Ubuntu 20.04 + Python 3.8 when automatically deploying Lithops on EC2. Because of this, I installed all packages and dependencies manually with pinned version:By upgrading to Ubuntu 22.04 and Python 3.10 we should get rid of this problem.
AWS Lambda
For the Lambda function, I decided to leave the same
runtime_memory
values as for IBM Code Engine. It is a bit illogical because the ratio between CPU and RAM is different, but at this stage it will allow us to compare the speed and price between AWS and IBM. Also, each environment has its own lambda functions. It will allow us to split costs between production/staging/dev environments. There is still a problem with Cloudwatch logs and lambda function mapping.Python code
annotation_job.py
file, the function responsible for uploading files on IBM COS was deleted.build_moldb.py
file,ThreadPoolExecutor
was used instead ofProcessPoolExecutor
, because the latter is not supported in AWS Lambda.executor.py
file: 3.1RUNTIME_***
constants were removed, their values were moved to config.json 3.2 Memory limits for various EC2s have been added inMEM_LIMITS
. 3.3 Added new parameters saved byperf profiler
. The volume of used RAM is now stored in MB instead of KB before. 3.4 Changed the logic of creating AWS executors and selecting an executor both in the first step and in the case of OOMload_ds.py
file: 4.1 Removed all print statements. 4.2 Started saving the time of each operation separately inside the_sort_spectra
function. I want to havenp.argsort()
time separately. 4.3 We no longer need to store imzML browser files in a temporary bucket. We store them immediately in the appropriate bucket. 4.4. Because the MD5 hash calculation works in one thread, for large datasets it takes more time than the actual sorting. Due to the high price of an EC2 instance, I moved the calculation from theload_ds
step.pipeline.py
, although I still don't think it's the optimal solution.Buckets
Two additional buckets were created for each environment:
sm-centroids-staging
- for storing moldb and peak centroids filessm-lithops-temp-staging
- for temporary storing input/output files for each pipeline stepIAM roles
sm-lithops-staging
Config file
Everything related to IBM Cloud has been completely removed.
Not finished yet: