huggingface / pytorch-image-models

The largest collection of PyTorch image encoders / backbones. Including train, eval, inference, export scripts, and pretrained weights -- ResNet, ResNeXT, EfficientNet, NFNet, Vision Transformer (ViT), MobileNetV4, MobileNet-V3 & V2, RegNet, DPN, CSPNet, Swin Transformer, MaxViT, CoAtNet, ConvNeXt, and more
https://huggingface.co/docs/timm
Apache License 2.0
31.6k stars 4.71k forks source link

[FEATURE] Detection models with loss functions and trainers #1353

Open pfeatherstone opened 2 years ago

pfeatherstone commented 2 years ago

It would be great if this repo started benchmarking detection models such as yolo family and Detr on COCO dataset. Big ask I know but it would be fantastic.

rwightman commented 2 years ago

@pfeatherstone I'm starting to work on object detection this summer. But I cannot bring in existing YOLO models, they are either GPL code + weights, or darknet based which is a bit bleh to work with. I will initially be focusing on a YOLO-inspired set of models that's good balance of speed and detection performance (hoping to match v5/v6/v7 in capability), leveraging timm backbones.

pfeatherstone commented 2 years ago

Awesome! Are the weights GPL or the actual model architectures too? Surely writing the models in timm and training from scratch is fine right ?

rwightman commented 2 years ago

@pfeatherstone I believe it would be very difficult to apply GPL to the full specification of an architecture, especially since there are papers, and a mix of GPL and non-GLP codebases covering aspects of the architectures(s), losses, and training (aug) pipelines. I have enough detail to reproduce it without cutting and pasting or closely following any existing GPL code.

rsomani95 commented 2 years ago

@rwightman I haven't looked at v6, but v5 and v7 are too rigid to work with -- there's no clear separation of the backbone, neck and bbox head, and the forward are rigid in that they rely on attributes of layers to determine the network path, making it near impossible to modify for different tasks (in my experience). I'm curious if this is the "bleh" part of working with them, or did you have something different in mind?

I've had success with plugging in timm backbones to YOLOX. Not having to depend on anchor boxes makes it easier to work with images of different sizes too. If there's scope for it, I'd love to contribute.

Here is a minimal example of how I've configured a YOLOX experiment with a timm architecture

zhiqwang commented 2 years ago

I haven't looked at v6, but v5 and v7 are too rigid to work with -- there's no clear separation of the backbone, neck and bbox head, and the forward are rigid in that they rely on attributes of layers to determine the network path, making it near impossible to modify for different tasks (in my experience).

Just FYI @rsomani95 , To resolve the scalability problem of yolov5 yaml-parse model building mechanism, we restructured the YOLOv5's model into following four sub-modules in the layout of TorchVision, I guess it can also used to integrate with timm.

Specifically, we expand the YOLOv5 yaml-style backbone into a torchvision-like write-up at darknetv6.py. Furthermore, we provide a mechanism for the unfolded model to load the official yolov5 checkpoints, this also ensures that our conversion mechanism is lossless.

pfeatherstone commented 2 years ago

@rwightman do you have a rough timeline on this ?

Chris-hughes10 commented 2 years ago

@rwightman feel free to reach out when you are making a start on this, I have a bunch of stuff that may be useful to this effort and would be happy to get involved

tyler-romero commented 1 year ago

I'm interested in a more centralized / pluggable repo for running / finetuning object detection models. Is this functionality still in the cards for timm?