Closed dennyabrain closed 5 months ago
@aatmanvaidya @duggalsu Here's a list of ec2 types offered by aws- https://aws.amazon.com/ec2/pricing/on-demand/ it lets you choose between memorry optimized, vs storage optimized vs compute optimized. it also lets you choose between core count and ram. review it and make a list for me of which instance types are worth evaluating.
I had an issue setting up feluda on my machine.
requirements. This could take a while.
ERROR: Cannot install -r requirements.txt (line 13) and urllib3==2.0.7 because these package versions have conflicting dependencies.
The conflict is caused by:
The user requested urllib3==2.0.7
botocore 1.34.19 depends on urllib3<1.27 and >=1.25.4; python_version < "3.10"
To fix this you could try to:
1. loosen the range of package versions you've specified
2. remove package versions to allow pip attempt to solve the dependency conflict
ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/topics/dependency-resolution/#dealing-with-dependency-conflicts
flagging this as something that might trump us.
Python 3.9.18 on Ubuntu 20.04.2 LTS. Using the c35079
This shouldn't have happened. But it is happening because we have not upgraded boto3 (and other packages) to the latest in feluda core. So there are dependency mismatches when generating requirements.txt
for operators and core, and i've been manually downgrading botocore
across requirements files and regenerating to maintain compatibility.
A note on trying to break the operator to its limit.
I removed the check around file size and tried processing a 1 hour long video that was 800 mb in size. The python process exits after 10 or so seconds. My rudimentary observation of htop
tells me that the all the 12 cores don't run at ful capacity but the memory usage increases with time and eventually the process runs out of memory. So I think in the short run, keeping the file size limit might be useful to prevent large files from causing a crash.
A neat thing. I eventually got the operator to run on this 1 hour video without running into out of memory error!!! Caveat : i just got the operator to run. I cant say anything about the search result implications of it
I figured that the cause for out of memory error was
def extract_frames(self, v):
# print("extracting frames")
images = []
for i in range(self.n_frames):
success, image = v.read()
if image is None:
continue
else:
if i % self.sampling_rate == 0:
images.append(Image.fromarray(image))
# print("extracted frames")
return images
Every frame of the video is added to the images
list. Hence we get the out of memory error.
I tried a rudimentary trick to convert this into a generator and return 100 frames at a time :
def extract_frames(self, v):
# print("extracting frames")
for i in range(0, self.n_frames, 100):
images = []
for i in range(100):
success, image = v.read()
if image is None:
continue
else:
if i % self.sampling_rate == 0:
images.append(Image.fromarray(image))
yield images
# print("extracted frames")
and the corresponding change in the analyze
function to consume this generator :
def analyze(self, video):
# print("analyzing video")
for frames in self.extract_frames(video):
feature_matrix = self.extract_features(frames)
self.keyframe_indices = self.find_keyframes(feature_matrix)
self.keyframe_features = feature_matrix[:, self.keyframe_indices]
# print("analysed video")
Result : function took 625.5453 seconds to run.
Current status is that we know RAM usage depends on the length of the video file. Given my proof of concept above, it looks like we can process long files by chunking the processing of frames and get a decent upper limit for RAM consumption. Given that for the next milestone, our priority is to be able to support processing of video files that are a few minutes long and that right now we dont anyways want to support processing really long files, we can assume the file length to be limited and hence the RAM usage also to be limited.
In today's call Aurora mentioned that looking at the code we use for inference, trying out a GPU wont be worth it also. So we are parking all GPU related tests for later as well.
This leaves us with compute optimized EC2s as the category of instances to try within. One thing we can also check for is that since our cores and memory isnt used at full capacity, this means that kubernetes can successfully schedule multiple pods on the same node. Getting us more value for money for every new node we provision.
documentation on memory and cpu profiling is here - https://github.com/tattle-made/feluda/wiki/Optimization
I've selected some EC2s for the first round of test. Included the hourly and daily cost because we might scale the nodes up and down and might not need a large node to stay on throughout.
EC2 type | vCPU | Memory | hourly USD | hourly INR | daily INR | monthly INR |
---|---|---|---|---|---|---|
c7g.large | 2 | 4 | 0.0491 | 4.078246 | 97.877904 | 2936.33712 |
c7g.xlarge | 4 | 8 | 0.1445 | 12.00217 | 288.05208 | 8641.5624 |
c7g.2xlarge | 8 | 16 | 0.289 | 24.00434 | 576.10416 | 17283.1248 |
c7g.4xlarge | 16 | 32 | 0.3926 | 32.609356 | 782.624544 | 23478.73632 |
c7g.16xlarge | 64 | 128 | 1.5706 | 130.454036 | 3130.896864 | 93926.90592 |
r7g.large | 2 | 16 | 0.0751 | 6.237806 | 149.707344 | 4491.22032 |
r7g.xlarge | 4 | 32 | 0.1502 | 12.475612 | 299.414688 | 8982.44064 |
r7g.4xlarge | 16 | 128 | 0.704 | 58.47424 | 1403.38176 | 42101.4528 |
@aatmanvaidya @duggalsu when we deploy the container to kubernetes, we can specify the command it should run when launched. for the sake of this test I was thinking we can create scripts inside /benchmark
folder.
The caveat to our script is that it should not exit but stay alive (think infinite loop), so that kubernetes does not kill the container. The other reason i need the container to be running is that after the test is run I'd like to ssh into it to get the output files.
So i was thinking that our bench mark script could be something like this script1.sh
python test.py
tail -f /dev/null
script2.sh
python3 -m memray run -o vid_vec_rep_resnet.bin vid_vec_rep_resnet.py
tail -f /dev/null
So lets create appropriate scripts like these. Then we can deploy the container and change the command that needs to be executed on container start and run these tests in the cluster.
Sharing the Kubernetes deployment file for reference. We'll simply change the replica count and command to run different containers.
apiVersion: apps/v1
kind: Deployment
metadata:
name: feluda-operator-vidvec
labels:
app.kubernetes.io/name: feluda-operator-vidvec
spec:
replicas: 1
resources:
requests:
cpu: "1000m"
memory: "4000Mi"
limits:
cpu: "4000m"
memory: "8000Mi"
selector:
matchLabels:
app.kubernetes.io/name: feluda-operator-vidvec
template:
metadata:
labels:
app.kubernetes.io/name: feluda-operator-vidvec
spec:
containers:
- name: feluda-operator-vidvec
image: tattletech/feluda-operator-vid-vec:f6bb56c
imagePullPolicy: Always
command: ["python"]
args: ["test.py"]
We'll rely on the github actions to push new docker images of our operators to dockerhub. Reference implementation https://github.com/tattle-made/feluda/blob/9f425587f93e02005554b496c059144c90e19f74/.github/workflows/prod-deploy.yml#L44-L50
We are charged hourly on the EC2 instance usage so once spun up, we have no reason to shut down the EC2 immediately. So we can run a few tests in one go within that hour and learn all we need to before shutting it down.
We then repeate these steps for all EC2 instances we care to test this on.
The dockerfiles for operators are now running successfully with this PR - https://github.com/tattle-made/feluda/pull/58
The feluda core dockerfile is also optimized.
Used the creosote
package to check for unused dependencies creosote --deps-file ./src/api/requirements.in
False flags:
Issues:
Removed:
import sentence_transformers
import skimage
Current docker image sizes
REPOSITORY SIZE
python:3.11-slim 131MB
WITH - python:3.11-slim : original dockerfile with ffmpeg feluda-api 2.88GB
WITH - python:3.11-slim : removed ffmpeg feluda-api 2.36GB
WITH - python:3.11-slim : removed tesseract packs feluda-api 2.25GB
WITH - python:3.11-slim : removed multiple unused python packages from feluda core feluda-api 1.53GB
image-operator 1.3GB video-operator 1.66GB
A trivial feedback on using --no-cache-dir
as an argument to pip install. Did a quick try and it bought the video-operator size to 1.35 GB.
I also notice that the largest thing in the docker image is the torch library. around 800 mb. doesnt seem like much we can do to reduce it. what is your opinion?
Further optimized dockerfiles
- WITH - pip no cache optimization, and removing torch, torchvision, vim, curl from feluda core
feluda-indexer 470MB
feluda-reporter 470MB
feluda-api 470MB
image-operator 1.08GB
video-operator 1.35GB
Goal : Offer an acceptable response time (lets assume < 5 minutes for now) for every possible scenario.
Scenarios :
Question to focus on : Whey do we get slow performance on multicore intel machines (c7i* family) when we increase the number of pod replicas (container). Especially when core > 4
Tasks
Overview
Acceptance Criteria