ttxskk / AiOS

[CVPR 2024] Official Code for "AiOS: All-in-One-Stage Expressive Human Pose and Shape Estimation
https://ttxskk.github.io/AiOS/
Other
259 stars 5 forks source link

Required specs for inference #14

Open fermanda opened 4 months ago

fermanda commented 4 months ago

Thank you for the interesting works. Can you describe the required computer specs for running the inference?

I tried to run the inference, but I kept getting device ordinal error.

LOCAL_RANK environment seems to be set to 1 by default. I suspect that you use multiple GPU in a single computer?

Here is my computer specs for running the inference code.

OS   : WSL2 (Windows Subsystem for Linux, Ubuntu 20.04)
CPU  : Intel(R) Core(TM) i7-8700
RAM  : 32 Gb
GPU  : NVidia RTX 3080 Ti (12 Gb)

pytorch 1.13.0 + CUDA 11.7

I also tested on other pytorch version but the error persist.

I modified the code utils.init_distributed_mode(args) function in misc.py line 581. I added a print command to print the GPU used in line 612.

args.distributed = True
print(f"Cuda GPU device set to {args.local_rank}")
torch.cuda.set_device(args.local_rank)

And here is the error result

*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Cuda GPU device set to 0
Before torch.distributed.barrier()
Cuda GPU device set to 1
Traceback (most recent call last):
  File "main.py", line 441, in <module>
    main(args)
  File "main.py", line 99, in main
    utils.init_distributed_mode(args)
  File "/mnt/d/PythonProject/NERF/AiOS-main/util/misc.py", line 614, in init_distributed_mode
    torch.cuda.set_device(args.local_rank)
  File "/home/testrun/.virtualenvs/torch4/lib/python3.8/site-packages/torch/cuda/__init__.py", line 326, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 17593 closing signal SIGTERM

The line error Cuda GPU device set to X is called twice, before print("End torch.distributed.barrier()"). I supposed because of the multiprocess?

And then I forced to only use GPU:0 by adding os.environ['LOCAL_RANK'] = "0" in the beginning of main.py but I got the following error.

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Cuda GPU device set to 0
Cuda GPU device set to 0
Before torch.distributed.barrier()
Before torch.distributed.barrier()
Traceback (most recent call last):
Traceback (most recent call last):
  File "main.py", line 441, in <module>
  File "main.py", line 441, in <module>
    main(args)
  File "main.py", line 99, in main
    main(args)
  File "main.py", line 99, in main
    utils.init_distributed_mode(args)
  File "/mnt/d/PythonProject/NERF/AiOS-main/util/misc.py", line 620, in init_distributed_mode
    utils.init_distributed_mode(args)
  File "/mnt/d/PythonProject/NERF/AiOS-main/util/misc.py", line 620, in init_distributed_mode
    torch.distributed.barrier()
  File "/home/testrun/.virtualenvs/torch4/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 3145, in barrier
        work = default_pg.barrier(opts=opts)
torch.distributed.barrier()
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, internal error, NCCL version 2.14.3
ncclInternalError: Internal check failed.
Last error:
Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 1000

So, does it require multiple GPU to run the inference?

ttxskk commented 4 months ago

Hi @fermanda ,

Sorry for the delayed response. As far as I know, our code can run with one GPU. @WYJSJTU, could you please check this issue?

WYJSJTU commented 3 months ago

I think I've only tested it on at least two-gpu devices. It might require some modify to the code if it couldn't run on one GPU.

ttxskk commented 3 months ago

Hi @fermanda, how did you run the code? I ran it using the following script and did not encounter this error.

sh scripts/inference.sh data/checkpoint/aios_checkpoint.pth short_video.mp4 demo 2 0.1 1

Note: I have updated the code, so you should merge the latest code.

fermanda commented 3 months ago

@ttxskk @WYJSJTU , Thank you for your reply. I havent tested with the latest code, but I did it by following the previous README.

https://github.com/ttxskk/AiOS/commit/95bb85ca16e903d00f4d9a20397deeb063e6b670#diff-b335630551682c19a781afebcf4d07bf978fb1f8ac04c6bf87428ed5106870f5

Inference

  • Place the mp4 video for inference under AiOS/demo/
  • Prepare the pretrained models to be used for inference under AiOS/data/checkpoint
  • Inference output will be saved in AiOS/demo/{INPUT_VIDEO}_out
cd main
sh scripts/inference.sh {INPUT_VIDEO} {OUTPUT_DIR} 

# For inferencing short_video.mp4 with output directory of demo/short_video_out
sh scripts/inference.sh short_video demo

I would be grateful and close this issues if you describe the specific computer specs to run this code (CPU, RAM, GPU and number of GPU, Operating System) for easy replication.

Thank you in advance for your help.

formoree commented 1 month ago

Hi @fermanda ,

Sorry for the delayed response. As far as I know, our code can run with one GPU. @WYJSJTU, could you please check this issue?

I have the same problem as him.and I have multiple GPUs, I tried the command for 1 and 2, but all met the same problem.

ttxskk commented 1 month ago

Hi @formoree, How do you run the script?

formoree commented 1 month ago

Hi @formoree, How do you run the script?

Hi, I had solved this question, but I met another in issue (issue 24)[https://github.com/ttxskk/AiOS/issues/24]

ttxskk commented 1 month ago

Hi @formoree, thanks for your feedback. Would you mind telling us how to solve it?

formoree commented 1 month ago

I'm sorry, it's been so long that I've forgotten some details, but I remember it should be an issue with the Python package version; adjusting it should help.

Hi @formoree, thanks for your feedback. Would you mind telling us how to solve it?