vladmandic / face-api

FaceAPI: AI-powered Face Detection & Rotation Tracking, Face Description & Recognition, Age & Gender & Emotion Prediction for Browser and NodeJS using TensorFlow/JS
https://vladmandic.github.io/face-api/demo/webcam.html
MIT License
824 stars 149 forks source link

node multiprocess on GPU #179

Closed smerchkz closed 11 months ago

smerchkz commented 11 months ago

Hi, Vlad

thanks for library

is it possible node multipocess using gpu? i take you code https://github.com/vladmandic/face-api/blob/master/demo/node-multiprocess.js and paste

const tf = require("@tensorflow/tfjs-node-gpu"); const faceapi = require("@vladmandic/face-api/dist/face-api.node-gpu.js");

instead of your lines

const tf = require("@tensorflow/tfjs-node"); const faceapi = require("../dist/face-api.node.js");

then i use

const numWorkers = 1;

it works perfect my 1080Ti GPU load on 5-10%

but if i use

const numWorkers = 2;

videcard load 100%, it seems like it's going crazy.

about the same result, then i not use workers as in you code, i use alone nodejs server and demonize it on pm2 with config

exec_mode: "cluster", instances: 2

maybe need write some env vars for tf, then GPU allow normal work with many node processes? can you help me to highload on GPU?

smerchkz commented 11 months ago

my problem is that I can’t achieve 90%-100% complete utilization of the video card, in one node process i utilizate 20%, not more

vladmandic commented 11 months ago

when you're running with a single process, gpu never gets saturated - processing on gpu is small compared to getting data in and out of gpu. so your gpu utilization seems low at 10%.

but when you run with multiprocessing, then its a question of tensorflow-gpu global locks - it only allows one process to be active on gpu and even if that is just 10% of overall time, that means another process will wait for 90% of time for lock to be released and that wait shows as load as its happening deep inside the library.

with tensorflow-cpu, there is no issue since each process hits different core without much locking.

i don't see a clean way to do this since tfjs-node-gpu does not expose deep internals of tensorflow-gpu so app could in theory perform busy checks on its own.

i really don't see a clean solution here.

smerchkz commented 11 months ago

today i find solution allow use multiprocess on GPU, in pm2 in ecosytem.conf i add constant TF_FORCE_GPU_ALLOW_GROWTH: true My 1080ti has 11GB, i start 4 proceses in cluster and each process take about ~2GB videocard RAM , and summary 4 processes gave ~90% load on the GPU. I see this in watch -n 1 nvidia-smi

for default node-gpu use TF_FORCE_GPU_ALLOW_GROWTH=false, and first process take all videocard RAM, for second process there is no longer enough memory, so GPU goes crazy and does not give quick calculations, both processes were fighting for the same memory.

module.exports = { apps: [ { name: ".backend.node", script: "app.js", watch: true, ignore_watch: ["node_modules"], watch_options: { followSymlinks: false, }, autorestart: true, //max_memory_restart: "2G", //for CPU max_memory_restart: "4G", //for GPU wait_ready: true, force: false, env: { PORT: 8081, HOSTNAME: "127.0.0.1", TF_FORCE_GPU_ALLOW_GROWTH: true, }, //exec_mode: "fork", //or exec_mode: "cluster", instances: 4, //or //instances: "max", NODE_ENV: "production", }, ],

I want to share some inside information, for CPUs the best benchmarks are obtained with the setting instances: "max"

vladmandic commented 11 months ago

ah, brilliant! i thought that tensorflow-gpu locking was actually locking, but it was just memory contention. thanks for sharing!