warmshao / FasterLivePortrait

Bring portraits to life in Real Time!onnx/tensorrt support!实时肖像驱动!
455 stars 42 forks source link

Tensorrt is slow #45

Open Daniel-Kelvich opened 1 month ago

Daniel-Kelvich commented 1 month ago

I tried using your docker image and built my own from scratch. The speed on nvidia L4 is 40ms/frame which is ~25fps (same as plain torch.compile). The demo shows around 60% gpu load. Is there something i'm missing?

warmshao commented 1 month ago

Please show me the command you used to run the program, provide the running log, and why you chose to build from scratch even after using my Docker image.

Daniel-Kelvich commented 1 month ago

docker pull shaoguo/faster_liveportrait:v1

docker run -it --gpus=all \
  --name faster_liveportrait \
  -v ./:/root/FasterLivePortrait \
  --restart=always \
  -p 9870:9870 \
  shaoguo/faster_liveportrait:v1 \
  /bin/bash
ln -sf /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.550.90.07 /usr/lib
/x86_64-linux-gnu/libnvidia-ml.so.1
ln -sf /usr/lib/x86_64-linux-gnu/libcuda.so.550.90.07 /usr/lib/x86_
64-linux-gnu/libcuda.so.1
   python run.py \
   --src_image assets/examples/source/s10.jpg \
   --dri_video assets/examples/driving/d14.mp4 \
   --cfg configs/trt_infer.yaml

inference median time: 72.76570796966553 ms/frame, mean time: 70.01137866902707 ms/frame

warmshao commented 1 month ago

It seems fine, I don't know why? I have tested TensorRT on multiple machines and it can be accelerated.

warmshao commented 1 month ago

I tried using your docker image and built my own from scratch. The speed on nvidia L4 is 40ms/frame which is ~25fps (same as plain torch.compile). The demo shows around 60% gpu load. Is there something i'm missing?

image L4 can achieve ~25fps? That doesn't seem likely, does it? I've tested it on an RTX 3060 and it was only in the teens for FPS.

Daniel-Kelvich commented 1 month ago

It can do 24 fps with torch.compile (only model inference without crop and insert). So I hopped I can push it further with tensorrt.

warmshao commented 1 month ago

It can do 24 fps with torch.compile (only model inference without crop and insert). So I hopped I can push it further with tensorrt.

It doesn't make sense to only look at model inference, as the whole process also includes a lot of image processing and time-consuming operations like paste back.

Daniel-Kelvich commented 1 month ago

Yes, but I can optimize it in different ways. I am more interested in pure model speed in this case.

administer03 commented 1 week ago

Hi, guys,

Have you solved the problem yet? @Daniel-Kelvich

I am also facing this problem on the H100. It seems slower than the results in the README