[RFC] Batteries Included - Phase 3

datumbox commented 2 years ago

🚀 The feature

Note: To track the progress of the project check out this board.

This is the 3rd phase of TorchVision's modernization project (see phase 1 and 2). We aim to keep TorchVision relevant by ensuring it provides off-the-shelf all the necessary primitives, model architectures and recipe utilities to produce SOTA results for the supported Computer Vision tasks.

1. New Primitives

To enable our users to reproduce the latest state-of-the-art research we will enhance TorchVision with the following data augmentations, layers, losses and other operators:

Data Augmentations

[ ] AutoAugment for Detection [1, 2] - #6224 #6609
[ ] Mosaic [1, 2] - #6534
[ ] Mixup for Detection [1, 2] - #6720 #6721

Losses

[ ] Dice Loss [1, 2] - #6435 #6960
[ ] Poly Loss [1, 2] - #6439 #6457

Operators added in PyTorch Core

[ ] LARS Optimizer [1, 2] - https://github.com/pytorch/pytorch/pull/88106
[ ] LAMB Optimizer [1, 2] - #6868
[x] Polynomial LR Scheduler [1, 2] - code - https://github.com/pytorch/pytorch/pull/82769

2. New Architectures & Model Iterations

To ensure that our users have access to the most popular SOTA models, we will add the following architectures along with pre-trained weights:

Image Classification

[x] Swin Transformer V2 - #6242 #6246
[ ] MobileViT v1 & v2 [1, 2] - #6404
[x] MaxViT - #6342

Video Classification

[x] MViTv2 [1] - #6373
[x] Swin3d [1] - #6499 #6521
[x] S3D [1] - #6402 #6412 #6537

3. Improved Training Recipes & Pre-trained models

To ensure that are users can have access to strong baselines and SOTA weights, we will improve our training recipes to incorporate the newly released primitives and offer improved pre-trained models:

Reference Scripts

[ ] Update the Reference Scripts to use the latest primitives - #6405 #6433

Pre-trained weights

[ ] Improve the accuracy of Video models

Other Candidates

There are several other Operators (#5414), Losses (#2980), Augmentations (#3817) and Models (#2707) proposed by the community. Here are some potential candidates that we could implement depending on bandwidth. Contributions are welcome for any of the below:

YOLOX [1] - #6341
DeTR - #5922 #6922
U-Net - #6610 #6611
MViTv2 for Images [1]
Video Transformer Network [1]
MTV
Deformable DeTR
Shortcut Regularizer (FX-based)
Hide-and-Seek - #6796

cc @datumbox @vfdev-5

deepwilson commented 1 year ago

@oke-aditya Thanks. :) Are there any open topics? I can see that many of the topics/tasks are already taken.

datumbox commented 1 year ago

@deepwilson It's a tough period for the team as it's doesn't have enough resources. Myself I have changed jobs so it's harder to follow up with every ongoing initiative. It would be very nice to finally add DETR to the library but it might be a challenge training it. Not sure if @pmeier or @vfdev-5 have any good issues that they could use community help?

pytorch / vision