Closed raphael0202 closed 4 weeks ago
Full Slack thread discussing this issue: https://openfoodfacts.slack.com/archives/C1FPYCWM7/p1683272552984549
Here is a summary of the discussion so far: We decided to purchase a new server that will be hosted by Free. This server will be powerful enough to host Robotoff and the new search project. We can add 1 or 2 GPUs to the server to speed up ML inference. With the search project grant, we have ~10k of budget for infrastructure, so we can use it to buy a high-end server (GPU will be bought with a different budget).
Requirements:
We don't necessarily need to use proxmod + QEMU, we can use baremetal here for additional performance gain. Besides, only PostgreSQL database is important for Robotoff, the rest of the data can be generated easily.
T4 GPU seems promising (better performance than P4), and with 1/4th the consumption of V100 GPUs (that are also more expensive). However, T4 GPUs only have 16 GB of RAM, I will conduct benchmarks to see if this is an issue (as we serve multiple ML models). Having 2 T4 GPUs instead of a single one is also an option.
@raphael0202 for disks, I would say we can have NVMe for operations (some Tb) + Sata for backups / replications (some Tb also).
Performed on GCP n1-highmem-4, with 1 Nvidia T4 GPU.
universal-logo-detector: 10 554 ms (100 requests, 1055.402490s total) nutrition table: 10 268 ms (100 requests, 1026.825259s total) nutriscore: N/A (no nutriscore model request) clip: 95.68 ms (79 requests, 326 inference count, 31.193658s total)
universal-logo-detector: 1 890 ms (94 requests, 177.694058s total): x5.6 speedup nutrition table: 1 555 ms (94 requests, 146.187220s total): x6.6 speedup nutriscore: 1 455 ms (22 requests, 32.022006s total) clip: 4.795 ms (74 requests, 302 inference count, 1.448200s total): x20 speedup
Stable GPU VRAM consumption, ~6.8GB
I think robotoff is now running on moji server, which means this issue could be now closed... cc @raphael0202
Yes indeed 👍
We're currently lacking space storage on OVH2, due to Robotoff. We store in PostgreSQL all logo embeddings (7M currently, but many are missing) + all image embeddings (7M images, but only ~1M are in DB now). Elasticsearch indexes for approximate nearest neighbors (ANN) also take up a lot of storage disk.
We would also like to increase the number of workers on Robotoff to make processing faster (workers are responsible for performing actions after product updates or image upload), but we cannot as increasing the number of workers would mean having a high server load.
It is expected that storage, CPU and memory needs are going to increase for Robotoff, due to the introduction of new models and detections. However, we don't have much option on OVH2: we cannot increase the number of CPU cores (?). We also cannot increase storage, as we zfs-sync OVH2 to OVH3 and as OVH1 doesn't have a lot of remaining space.
I suggest we move Robotoff to a distinct server, either rented or hosted at OVH datacenter.