openfoodfacts / openfoodfacts-infrastructure

Where we collaboratively plan and maintain the infrastructure of Open Food Facts
3 stars 6 forks source link

Proposal: move Robotoff services to a distinct machine #218

Open raphael0202 opened 1 year ago

raphael0202 commented 1 year ago

We're currently lacking space storage on OVH2, due to Robotoff. We store in PostgreSQL all logo embeddings (7M currently, but many are missing) + all image embeddings (7M images, but only ~1M are in DB now). Elasticsearch indexes for approximate nearest neighbors (ANN) also take up a lot of storage disk.

We would also like to increase the number of workers on Robotoff to make processing faster (workers are responsible for performing actions after product updates or image upload), but we cannot as increasing the number of workers would mean having a high server load.

It is expected that storage, CPU and memory needs are going to increase for Robotoff, due to the introduction of new models and detections. However, we don't have much option on OVH2: we cannot increase the number of CPU cores (?). We also cannot increase storage, as we zfs-sync OVH2 to OVH3 and as OVH1 doesn't have a lot of remaining space.

I suggest we move Robotoff to a distinct server, either rented or hosted at OVH datacenter.

raphael0202 commented 1 year ago

Full Slack thread discussing this issue: https://openfoodfacts.slack.com/archives/C1FPYCWM7/p1683272552984549

Here is a summary of the discussion so far: We decided to purchase a new server that will be hosted by Free. This server will be powerful enough to host Robotoff and the new search project. We can add 1 or 2 GPUs to the server to speed up ML inference. With the search project grant, we have ~10k of budget for infrastructure, so we can use it to buy a high-end server (GPU will be bought with a different budget).

Requirements:

We don't necessarily need to use proxmod + QEMU, we can use baremetal here for additional performance gain. Besides, only PostgreSQL database is important for Robotoff, the rest of the data can be generated easily.

T4 GPU seems promising (better performance than P4), and with 1/4th the consumption of V100 GPUs (that are also more expensive). However, T4 GPUs only have 16 GB of RAM, I will conduct benchmarks to see if this is an issue (as we serve multiple ML models). Having 2 T4 GPUs instead of a single one is also an option.

alexgarel commented 1 year ago

@raphael0202 for disks, I would say we can have NVMe for operations (some Tb) + Sata for backups / replications (some Tb also).

raphael0202 commented 1 year ago

Model inference latency benchmark

Performed on GCP n1-highmem-4, with 1 Nvidia T4 GPU.

CPU

universal-logo-detector: 10 554 ms (100 requests, 1055.402490s total) nutrition table: 10 268 ms (100 requests, 1026.825259s total) nutriscore: N/A (no nutriscore model request) clip: 95.68 ms (79 requests, 326 inference count, 31.193658s total)

GPU

universal-logo-detector: 1 890 ms (94 requests, 177.694058s total): x5.6 speedup nutrition table: 1 555 ms (94 requests, 146.187220s total): x6.6 speedup nutriscore: 1 455 ms (22 requests, 32.022006s total) clip: 4.795 ms (74 requests, 302 inference count, 1.448200s total): x20 speedup

Stable GPU VRAM consumption, ~6.8GB