tskir commented 4 months ago

This issue is a part of the https://github.com/opentargets/issues/issues/3302 epic.

The goal of this issue to configure VM types and task submission parameters so that tasks don't fail due to RAM constraints, but at the same time resources are not wasted.

tskir commented 4 months ago

RAM

As a reminder, the current worker VM type is n2d-highmem-4, with 4 cores and 32 GB of RAM. This is how they were used:

Run v5 allocated 4 jobs per VM, so each got 1 core and ~7 GB of RAM (including OS overhead). In this configuration, CPU and RAM usage was close to 100%, but several hundred jobs crashed apparently due to RAM limitations.
For run v6, I only submitted 2 tasks for each machine, so each got 2 core and ~15 GB of RAM. Under this configuration, none of the jobs failed due to RAM constraints, but resource usage was not perfectly optimal. Because the jobs only run in single cores, CPU usage oscillates around 50%, and maximum RAM usage ever observed on any VM is 69%. So it will like the optimal parameters are somewhere in the middle. For example, the next run, I can try 3 jobs per a VM.

There is also a family of “ultramem” VMs which provide a lot of RAM per one CPU core. I will also briefly look into them to see if this can be a good, cost-effective option.

tskir commented 4 months ago

2. Execution time

In the v5 run, the run limit for a single job was 3600s. I initially suspected that some jobs failed due to the time limit (it was difficult to tell because RAM and time failures don't provide any specific log entries.)

In the v6 run, the run limit was raised to 7200s. No jobs failed due to the time limit. However, upon investigating the benchmarking logs, the longest job in the v6 run took 1911s total, so this doesn't appear to be an issue.

tskir commented 4 months ago

Note to self: see also Storage Class A operations, currently around 350 per row in each run; this can be optimised.

opentargets / issues

Parallel finemapping: Optimise resource usage #3315

RAM

2. Execution time