Run pipeline on AWS - Githubissues

sergii-mamedov commented 9 months ago

Overview

The goal of this PR is the final migration from IBM Cloud to AWS (transfer everything related to the lithops pipeline). The main milestones are listed and briefly described below.

Docker images

We have a separate Dockerfiles for AWS EC2 and AWS Lambda from which we create Docker images. For EC2, the docker image saved on Docker Hub, for Lambda on AWS ECR.

AWS EC2

We continue to use a consume mode for now. This approach has its pros and cons. EC2 instance startup time is about 30 seconds, which should be faster than for a create mode. Unfortunately, spot instances not available for consume mode. It is worth investigating whether create mode will be relevant for us in terms of saving money (due to the possibility of using spot instances). However, I changed the logic of using EC2 instances. Previously, we only had one virtual machine with 128 or 256 GB of RAM. Due to the 10GB RAM limit for the Lambda function, we will be using EC2 more often. So I created four separate EC2 instances with 32, 64, 128 and 256 GB of RAM and lithops executor will run over these instances with the least amount of memory. Due to the fact that we are using an old version of python (3.8), we are forced to use Ubuntu 20.04. There are some dependency issues for Ubuntu 20.04 + Python 3.8 when automatically deploying Lithops on EC2. Because of this, I installed all packages and dependencies manually with pinned version:

sudo apt update
sudo apt upgrade
sudo apt install python3-pip

sudo pip install PyYAML==5.4.1
sudo pip install requests==2.31.0
sudo pip install httplib2==0.19.0
sudo pip install urllib3==1.26.16
sudo pip install pyOpenSSL==23.2.0
sudo pip install tblib==1.7.0
sudo pip install flask gevent
sudo pip install lithops==3.1.0

sudo apt-get clean && sudo  rm -rf /var/lib/apt/lists/*

By upgrading to Ubuntu 22.04 and Python 3.10 we should get rid of this problem.

AWS Lambda

For the Lambda function, I decided to leave the same runtime_memory values as for IBM Code Engine. It is a bit illogical because the ratio between CPU and RAM is different, but at this stage it will allow us to compare the speed and price between AWS and IBM. Also, each environment has its own lambda functions. It will allow us to split costs between production/staging/dev environments. There is still a problem with Cloudwatch logs and lambda function mapping.

Python code

In the annotation_job.py file, the function responsible for uploading files on IBM COS was deleted.
In the build_moldb.py file, ThreadPoolExecutor was used instead of ProcessPoolExecutor, because the latter is not supported in AWS Lambda.
In executor.py file: 3.1 RUNTIME_*** constants were removed, their values were moved to config.json 3.2 Memory limits for various EC2s have been added in MEM_LIMITS. 3.3 Added new parameters saved by perf profiler. The volume of used RAM is now stored in MB instead of KB before. 3.4 Changed the logic of creating AWS executors and selecting an executor both in the first step and in the case of OOM
In load_ds.py file: 4.1 Removed all print statements. 4.2 Started saving the time of each operation separately inside the _sort_spectra function. I want to have np.argsort() time separately. 4.3 We no longer need to store imzML browser files in a temporary bucket. We store them immediately in the appropriate bucket. 4.4. Because the MD5 hash calculation works in one thread, for large datasets it takes more time than the actual sorting. Due to the high price of an EC2 instance, I moved the calculation from the load_ds step.
So far I've moved the MD5 hash calculation to pipeline.py, although I still don't think it's the optimal solution.

Buckets

Two additional buckets were created for each environment:

sm-centroids-staging - for storing moldb and peak centroids files
sm-lithops-temp-staging - for temporary storing input/output files for each pipeline step

IAM roles

sm-lithops-staging

Config file

Everything related to IBM Cloud has been completely removed.

Not finished yet:

Fixing of one test: #1471
Fetching data for calculating the total cost of each step and processing the entire dataset.

lmacielvieira commented 6 months ago

Changes are needed to run the pipeline locally with docker. Given that two buckets were created, it is needed to change the config files and main sm-engine Dockerfile script to create it. i.e https://github.com/metaspace2020/metaspace/blob/c74d385213dd3c5f87ba57c2cd8d4c6262b913c6/metaspace/engine/conf/config.json.template#L194 https://github.com/metaspace2020/metaspace/blob/c74d385213dd3c5f87ba57c2cd8d4c6262b913c6/docker/docker-compose.yml#L205

sergii-mamedov commented 5 months ago