Tattle should be able to monitor system throughput

tattle-made / DAU

MCA Tipline for Deepfakes

GNU General Public License v3.0

6 stars 0 forks source link

Tattle should be able to monitor system throughput #15

Open dennyabrain opened 5 months ago

dennyabrain commented 5 months ago

Tickets

[ ] Simulate Load
[ ] Visualize Performance
[ ] Horizontal Scaling (Automatic/Manual)
[ ] Vertical Scaling (Automatic/Manual)

dennyabrain commented 5 months ago

Identified Bottlenecks

If our response time times out their request, they will throttle their outgoing requests (think exponential backoff) which will affect responsiveness of the tipline for the users. We'll have to scale up our web server and postgres server accordingly (needs strategy). Here the bottleneck is the web server's capacity to process incoming requests and write concurrency of the database.

dennyabrain commented 5 months ago

Deliverable

throughput for every operator on a particular ec2 instance.

Example Report for image_vec_resenet :	ec2 instance	throughouput	pricing (hourly)[^1]
t4g.2xlarge(8 core, 32 GB Ram)		$0.1792
r6gd.medium(1 core, 8 GB Ram)		$0.0374

DevOps will setup various ec2 instances and provide endpoints for loadtesting.

[^1]: Pricing Source : https://aws.amazon.com/ec2/pricing/on-demand/

dennyabrain commented 5 months ago

Optimizations and Fine Tuning

We need baseline numbers on how well our operators do currently. We also need to find ways to make the most optimized docker containers for the operators. This might include fine tuning our dependencies. For instance using pytorch compiled for cpu vs gpu, ensuring ffmpeg is configured to use all cores of a machine etc.

dennyabrain commented 5 months ago

Questions to find answers for :

What is better for our operator(s) - large multicore machine or small single core machines
What do our operators depend on (pytorch, ffmpeg, docker image) and learn about ways to tune their performance.