kenshohara / 3D-ResNets-PyTorch

3D ResNets for Action Recognition (CVPR 2018)
MIT License
3.88k stars 930 forks source link

Is there a simple way to get inference results on video files that are not part of the considered data sets? #237

Open quenot opened 3 years ago

quenot commented 3 years ago

Any hint would be welcome, including, if necessary, how to patch the code for that. Thanks for sharing this excellent work. Best regards, Georges.

dribnet commented 3 years ago

I made a standalone utility to do this - it can run inference on any of their three pre-trained models given an input sequence of image files. You can find my utility called rawrun.py in my forked copy - and I'd be happy to contribute this back if it is useful.

For example, if you have the kinetics dataset already prepared you can try running inference on some of the training data by running something like:

python rawrun.py \
  --depth 50 \
  --input-glob 'data/kinetics_videos/jpg/yoga/0wHOYxjRmlw_000041_000051/image_000{41,42,43,44,45,46,47,48,49,50,41,42,43,44,45,46}.jpg'

And it will run these 16 input images against the resnet50 model and report back a json with the top-3 results:

[{'label': 'yoga', 'score': 0.9823519587516785},
 {'label': 'head stand', 'score': 0.0077535356394946575},
 {'label': 'bending back', 'score': 0.0018582858610898256}]

For reference, here's what those particular 16 input files look like in order: yoga_grid_16

Note that this particular input-glob matches exactly the training data (incidentally I found it bizarre that all of the training data had these "skips" in them instead of being temporally contiguous) so we would expect a good score. This rawrun utility can also take fewer than 16 inputs files and will pad them appropriately - so you can get identical results as above providing just the 10 marked frames:

python rawrun.py \
  --depth 50 \
  --input-glob 'data/kinetics_videos/jpg/yoga/0wHOYxjRmlw_000041_000051/image_000{41,42,43,44,45,46,47,48,49,50}.jpg'
[{'label': 'yoga', 'score': 0.9823519587516785},
 {'label': 'head stand', 'score': 0.0077535356394946575},
{'label': 'bending back', 'score': 0.0018582858610898256}]

Or you can even test individual files to see how well the model does with only spatial cues and without access to any temporal differences:

python rawrun.py \
  --depth 50 \
  --input-glob 'data/kinetics_videos/jpg/yoga/0wHOYxjRmlw_000041_000051/image_00041.jpg'
[{'label': 'yoga', 'score': 0.7186599969863892},
 {'label': 'head stand', 'score': 0.14070658385753632},
 {'label': 'bending back', 'score': 0.02092701569199562}]

Hope this is useful for other and note a huge part of getting this utility to work was getting access to the labels in the correct training order for their pre-trained models, and so I want to thank @Jimmy880 for getting these recently in #211

Purav-Zumkhawala commented 3 years ago

comment link

See if this helps