Human-Object Interaction Prediction

Introduction

The official implementation of our paper "Human-Object Interaction Prediction in Videos through Gaze Following" accepted by CVIU with doi: https://doi.org/10.1016/j.cviu.2023.103741. ArXiv preprint available: https://arxiv.org/abs/2306.03597.

Our results

On VidHOI (Oracle mode):	Future Time	mAP (QPIC)	Person-wise top-5 Recall	Precision	Accuracy
0s (Detection)	38.61	70.91	59.84	51.29	62.24
1s	37.59	72.17	59.98	51.65	62.78
3s	33.14	71.88	60.44	52.08	62.87
5s	32.75	71.25	59.09	51.14	61.92
7s	31.70	70.48	58.80	50.56	61.36

On Action Genome (PredCls mode):	Rec@10	Rec@20	Rec@50
75.4	83.7	84.3

[ ] TODO: show qualitative result

Install

Clone the repository recursively:
git clone --recurse-submodules https://github.com/nizhf/hoi-prediction-gaze-transformer.git

Create conda environment. We use mamba to accelerate the installation. In addition, as issue #7, opencv in conda-forge seems to be incompatible with torchvision 0.11.0, so we install opencv via pip.

conda install mamba -c conda-forge  # install mamba in base environment
mamba create -n hoi_torch110 python=3.9 -c conda-forge 
conda activate hoi_torch110  
mamba install pytorch==1.10.0 torchvision==0.11.0 torchaudio==0.10.0 torchtext==0.11.0 cudatoolkit=11.3 black Cython easydict gdown imageio ipywidgets matplotlib notebook numpy pandas Pillow PyYAML requests scikit-learn scipy seaborn tqdm tensorboard wandb -c pytorch -c conda-forge
pip install opencv-python

Our training is using wandb to record training and validation metrics. You may create an account at https://wandb.ai and follow their instruction to login on your PC.

Inference on an arbitrary video

Some weights can be downloaded automatically. Some weights need to be downloaded manually from here: all weights in weights/sttrangaze, weights/yolov5/vidor_yolov5l.pt. Put them into weights/... folder in this repo. Also, if automatic download does not work, you can download them from the same link.
Run the run.py script
```
# Detection
python run.py --source path/to/video --out path/to/output_folder --future 0 --hoi-thres 0.3 --print
# For anticipation, set future to 1, 3, 5, or 7
```
This script will create a video with object tracking and gaze following. The HOIs are saved in the output folder and printed in 1 FPS in the console.

Train and evaluate the model

Please follow this instruction to prepare the datasets and train our model.

Citation

If our work is helpful for your research, please consider citing our publication:

@article{NI2023103741,
  title = {Human–Object Interaction Prediction in Videos through Gaze Following},
  journal = {Computer Vision and Image Understanding},
  volume = {233},
  pages = {103741},
  year = {2023},
  issn = {1077-3142},
  doi = {https://doi.org/10.1016/j.cviu.2023.103741},
  url = {https://www.sciencedirect.com/science/article/pii/S1077314223001212},
  author = {Zhifan Ni and Esteve {Valls Mascar\'o} and Hyemin Ahn and Dongheui Lee},
}

nizhf / hoi-prediction-gaze-transformer

readme