ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation

amyeroberts commented 1 year ago

Model description

ViTPose is used in 2D human pose estimation, a subset of the keypoint detection task #24044

It provides a simple baseline for vision transformer-based human pose estimation. It utilises a pretrained vision transformer backbone to extract features and a simple decoder head to process the extracted features. Despite no elaborate designs in the model, ViTPose obtained state-of-the-art (SOTA) performance of 80.9 AP on the MS COCO Keypoint test-dev set.

Open source status

[X] The model implementation is available
[X] The model weights are available

Provide useful links for the implementation

Code and weights: https://github.com/ViTAE-Transformer/ViTPose Paper: https://arxiv.org/abs/2204.12484

@Annbless

ydshieh commented 1 year ago

Glad you get something different to work on 🚀 👀 🎉

shauray8 commented 1 year ago

Hi, @amyeroberts, I don't know if you are working on this but if not I would be more than happy to take it up.

ydshieh commented 1 year ago

Oh, this is the issue page, not the PR page!

amyeroberts commented 1 year ago

@shauray8 You're very welcome to take this up! :)

This model presents a new task for the library, so there might be some iterations and discussions on what the inputs and outputs should look like. The model translation should be fairly straightforward though, so I'd suggest starting with a PR that implements that and then on the PR we can figure out what works best.

huggingface / transformers