Finalize feluda operator system requirement

dennyabrain commented 5 months ago

Overview

[x] Understand Operator dependency locally and identify ways to improve
[x] Finalize the acceptable status of the video operators
[x] Deploy on various ec2 nodes and profile performance
[ ] Finalize and Publish instructions to deploy workers to cluster. Tracked here

Acceptance Criteria

[x] Identify the few optimum EC2 machines for various scenarios
[ ] Scenario Plan and cost for each

dennyabrain commented 5 months ago

@aatmanvaidya @duggalsu Here's a list of ec2 types offered by aws- https://aws.amazon.com/ec2/pricing/on-demand/ it lets you choose between memorry optimized, vs storage optimized vs compute optimized. it also lets you choose between core count and ram. review it and make a list for me of which instance types are worth evaluating.

dennyabrain commented 5 months ago

I had an issue setting up feluda on my machine.

requirements. This could take a while.
ERROR: Cannot install -r requirements.txt (line 13) and urllib3==2.0.7 because these package versions have conflicting dependencies.

The conflict is caused by:
    The user requested urllib3==2.0.7
    botocore 1.34.19 depends on urllib3<1.27 and >=1.25.4; python_version < "3.10"

To fix this you could try to:
1. loosen the range of package versions you've specified
2. remove package versions to allow pip attempt to solve the dependency conflict

ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/topics/dependency-resolution/#dealing-with-dependency-conflicts

flagging this as something that might trump us.

Python 3.9.18 on Ubuntu 20.04.2 LTS. Using the c35079

duggalsu commented 5 months ago

This shouldn't have happened. But it is happening because we have not upgraded boto3 (and other packages) to the latest in feluda core. So there are dependency mismatches when generating requirements.txt for operators and core, and i've been manually downgrading botocore across requirements files and regenerating to maintain compatibility.

dennyabrain commented 5 months ago

A note on trying to break the operator to its limit. I removed the check around file size and tried processing a 1 hour long video that was 800 mb in size. The python process exits after 10 or so seconds. My rudimentary observation of htop tells me that the all the 12 cores don't run at ful capacity but the memory usage increases with time and eventually the process runs out of memory. So I think in the short run, keeping the file size limit might be useful to prevent large files from causing a crash.

dennyabrain commented 5 months ago

A neat thing. I eventually got the operator to run on this 1 hour video without running into out of memory error!!! Caveat : i just got the operator to run. I cant say anything about the search result implications of it

I figured that the cause for out of memory error was

https://github.com/tattle-made/feluda/blob/c350792db1ffcb9c53149ea42f68adf5f4b0cd07/src/api/core/operators/vid_vec_rep_resnet.py#L157C9-L168C26

        def extract_frames(self, v):
            # print("extracting frames")
            images = []
            for i in range(self.n_frames):
                success, image = v.read()
                if image is None:
                    continue
                else:
                    if i % self.sampling_rate == 0:
                        images.append(Image.fromarray(image))
            # print("extracted frames")
            return images

Every frame of the video is added to the images list. Hence we get the out of memory error.

I tried a rudimentary trick to convert this into a generator and return 100 frames at a time :

def extract_frames(self, v):
            # print("extracting frames")
            for i in range(0, self.n_frames, 100):
                images = []
                for i in range(100):
                    success, image = v.read()
                    if image is None:
                        continue
                    else:
                        if i % self.sampling_rate == 0:
                            images.append(Image.fromarray(image))
                yield images
            # print("extracted frames")

and the corresponding change in the analyze function to consume this generator :

def analyze(self, video):
            # print("analyzing video")
            for frames in self.extract_frames(video):
                feature_matrix = self.extract_features(frames)
                self.keyframe_indices = self.find_keyframes(feature_matrix)
                self.keyframe_features = feature_matrix[:, self.keyframe_indices]
            # print("analysed video")

Result : function took 625.5453 seconds to run.

dennyabrain commented 5 months ago

Current status is that we know RAM usage depends on the length of the video file. Given my proof of concept above, it looks like we can process long files by chunking the processing of frames and get a decent upper limit for RAM consumption. Given that for the next milestone, our priority is to be able to support processing of video files that are a few minutes long and that right now we dont anyways want to support processing really long files, we can assume the file length to be limited and hence the RAM usage also to be limited.

In today's call Aurora mentioned that looking at the code we use for inference, trying out a GPU wont be worth it also. So we are parking all GPU related tests for later as well.

This leaves us with compute optimized EC2s as the category of instances to try within. One thing we can also check for is that since our cores and memory isnt used at full capacity, this means that kubernetes can successfully schedule multiple pods on the same node. Getting us more value for money for every new node we provision.

aatmanvaidya commented 5 months ago

documentation on memory and cpu profiling is here - https://github.com/tattle-made/feluda/wiki/Optimization

dennyabrain commented 5 months ago

I've selected some EC2s for the first round of test. Included the hourly and daily cost because we might scale the nodes up and down and might not need a large node to stay on throughout.

EC2 type	vCPU	Memory	hourly USD	hourly INR	daily INR	monthly INR
c7g.large	2	4	0.0491	4.078246	97.877904	2936.33712
c7g.xlarge	4	8	0.1445	12.00217	288.05208	8641.5624
c7g.2xlarge	8	16	0.289	24.00434	576.10416	17283.1248
c7g.4xlarge	16	32	0.3926	32.609356	782.624544	23478.73632
c7g.16xlarge	64	128	1.5706	130.454036	3130.896864	93926.90592
r7g.large	2	16	0.0751	6.237806	149.707344	4491.22032
r7g.xlarge	4	32	0.1502	12.475612	299.414688	8982.44064
r7g.4xlarge	16	128	0.704	58.47424	1403.38176	42101.4528

dennyabrain commented 5 months ago

@aatmanvaidya @duggalsu when we deploy the container to kubernetes, we can specify the command it should run when launched. for the sake of this test I was thinking we can create scripts inside /benchmark folder. The caveat to our script is that it should not exit but stay alive (think infinite loop), so that kubernetes does not kill the container. The other reason i need the container to be running is that after the test is run I'd like to ssh into it to get the output files.

So i was thinking that our bench mark script could be something like this script1.sh

python test.py
tail -f /dev/null

script2.sh

python3 -m memray run -o vid_vec_rep_resnet.bin vid_vec_rep_resnet.py
tail -f /dev/null

So lets create appropriate scripts like these. Then we can deploy the container and change the command that needs to be executed on container start and run these tests in the cluster.

dennyabrain commented 5 months ago

Sharing the Kubernetes deployment file for reference. We'll simply change the replica count and command to run different containers.

apiVersion: apps/v1
kind: Deployment

metadata:
  name: feluda-operator-vidvec
  labels:
    app.kubernetes.io/name: feluda-operator-vidvec

spec:
  replicas: 1
  resources:
    requests:
        cpu: "1000m"
        memory: "4000Mi"
    limits:
        cpu: "4000m"
        memory: "8000Mi"
  selector:
    matchLabels:
      app.kubernetes.io/name: feluda-operator-vidvec
  template:
    metadata:
      labels:
        app.kubernetes.io/name: feluda-operator-vidvec
    spec:
      containers:
        - name: feluda-operator-vidvec
          image: tattletech/feluda-operator-vid-vec:f6bb56c
          imagePullPolicy: Always
          command: ["python"]
          args: ["test.py"]

dennyabrain commented 5 months ago

We'll rely on the github actions to push new docker images of our operators to dockerhub. Reference implementation https://github.com/tattle-made/feluda/blob/9f425587f93e02005554b496c059144c90e19f74/.github/workflows/prod-deploy.yml#L44-L50

dennyabrain commented 5 months ago

Workflow :

Denny will provision the EC2 instance we want to test on.
Aatman, Aurora make changes to the operator and push to github, which triggers(manually or automatically) a workflow that builds a docker image customized for the appropriate operator and pushes to the dockerhub. Each such push will tag the image on dockerhub with the commit id. so new versions of the image will be uniquely identifiable.
Denny will copy the commit id and place it in the kubernetes deployment manifest file above. He'll also change the replica count if apt to force more than one container to run on each node. When deployed this should run our test and leave the container idle.
Denny will SSH into the containers and download the output files
Aatman and Aurora will analyze the result.

We are charged hourly on the EC2 instance usage so once spun up, we have no reason to shut down the EC2 immediately. So we can run a few tests in one go within that hour and learn all we need to before shutting it down.

We then repeate these steps for all EC2 instances we care to test this on.

duggalsu commented 5 months ago

The dockerfiles for operators are now running successfully with this PR - https://github.com/tattle-made/feluda/pull/58
The feluda core dockerfile is also optimized.
Used the creosote package to check for unused dependencies creosote --deps-file ./src/api/requirements.in
False flags:
- memray - dev - memory profiling
- pyinstrument - dev/prod - cpu profiling
- pytest - dev - testing
- python-dotenv - dev/prod - loading key/value pairs from .env files - see docs/src/pages/burns.mdx for issues
Issues:
- boto3 - only used in detect_text_in_image - but aws config in development.env
- google-cloud - only used in detect_text_in_image - but google config in development.env
- google-cloud-vision - only used in detect_text_in_image - but google config in development.env
Removed:
- opencv-python-headless - only used in vid_vec_rep_resnet.py
- textblob - only used in
  - detect_lang_of_text
  - vid_vec_rep_resnet - declares as global but not used anywhere in code
- sentence-transformers - used only in text_vec_rep_paraphrase_lxml - import sentence_transformers
- tqdm - not imported anywhere directly
- scikit-image - not imported anywhere directly - import skimage
- typing-extensions - not imported anywhere directly - gets installed as dependency in generated requirements.txt

Current docker image sizes


REPOSITORY                                      SIZE
python:3.11-slim                              131MB

WITH - python:3.11-slim : original dockerfile with ffmpeg feluda-api 2.88GB
WITH - python:3.11-slim : removed ffmpeg feluda-api 2.36GB
WITH - python:3.11-slim : removed tesseract packs feluda-api 2.25GB
WITH - python:3.11-slim : removed multiple unused python packages from feluda core feluda-api 1.53GB

image-operator 1.3GB video-operator 1.66GB

dennyabrain commented 5 months ago

A trivial feedback on using --no-cache-dir as an argument to pip install. Did a quick try and it bought the video-operator size to 1.35 GB.

I also notice that the largest thing in the docker image is the torch library. around 800 mb. doesnt seem like much we can do to reduce it. what is your opinion?

duggalsu commented 5 months ago

Further optimized dockerfiles

- WITH - pip no cache optimization, and removing torch, torchvision, vim, curl from feluda core
feluda-indexer  470MB
feluda-reporter 470MB
feluda-api      470MB
image-operator  1.08GB
video-operator  1.35GB

dennyabrain commented 5 months ago

Scenario Planning :

Goal : Offer an acceptable response time (lets assume < 5 minutes for now) for every possible scenario.

Scenarios :

Consistent Traffic through the day : a. low (1000 messages a day) b. medium (50,000 messages a day) c. high (1,00,000 messages a day)
Traffic Surge on a known day (pre provisioned infrastructure) a. low (1000 messages in a minute) b. medium (50,000 messages in a minute) c. high (1,00,000 messages in a minute)
Unexpected Traffic surge on a day (provisioning will happen post facto) a. low (1000 messages in a minute) b. medium (50,000 messages in a minute) c. high (1,00,000 messages in a minute)

dennyabrain commented 5 months ago

Question to focus on : Whey do we get slow performance on multicore intel machines (c7i* family) when we increase the number of pod replicas (container). Especially when core > 4

dennyabrain commented 5 months ago

Tasks

[x] Test if the new arm images work on graviton
[x] Test the multiprocess code on c7i.4xlarge

dennyabrain commented 5 months ago

tattle-made / DAU