[RFC] Support YOLOX detection model

zhiqwang commented 2 years ago

🚀 The feature

YOLO aka. You Only Look Once, which is a vibrant series of object detection models since the release of Joseph Redmon You Only Look Once: Unified, Real-Time Object Detection.

So far a couple of more notable implementations are as follows (all PyTorch):

YOLOv3 2018. Cited by 14369 ^yolov3-ultralytics
YOLOv4 2020. Cited by 4905
YOLOv5 2020. Starred at GitHub 29.4k ^yolov5-ultralytics
[ ] YOLOX 2021. Cited by 299 ^yolox
YOLOv7 2022. ^yolov7

Motivation, pitch

Until now, one of the most successful ones is probably YOLOv5. YOLOv5 is great, and they have also built up a very friendly community and ecosystem. We don't intend to copy YOLOv5 into TorchVision, our main goal here is to make training SoTA models easier and share reusable subcomponents to build the next SoTA models in the same/proxy family.^yolo-keras

YOLOX is a high-performance anchor-free YOLO, and it has a good balance in terms of copyright and code quality, it's enough to have a YOLOX implementation from the community's perspective.

The License

YOLO{v5/v7} are built under the GPL-3.0 license, and YOLOX is built under the Apache-2.0 license.

More context

I have previously rewritten the code used in the inference part of YOLOv5 according to the style and specification of torchvision^yolort, and I can relicense that part to BSD-3-Clause license. The amount of work involved in the model inference part is not much with the help of YOLOX base code.

Data augmentation and a new trainer engine will be the core of what we will do here.

The data augmentation section is in the planning list https://github.com/pytorch/vision/issues/6224 , and we have already merged some augmentation methods like https://github.com/pytorch/vision/pull/5825 , I think it would help us to build the next SoTA models with a new primitives like classification models.^classification-primitives

As TorchVision adds more and more models, it may be time to abstract out a simple trainer engine for sharing reusable subcomponents. It might be more appropriate to open a new thread for necessity and specific steps about this part.

cc @datumbox @YosuaMichael @oke-aditya

datumbox commented 2 years ago

@zhiqwang Thank you very much for the comprehensive proposal. :)

Your implementation at YOLO5-RT-stack is indeed of very high quality. Having a modern implementation of YOLO was on our bucket list but I just want to be mindful and not cannibalise your project. After all PyTorch's unique value proposition is its rich ecosystem. Having said that, if you are happy to upstream parts of your repo to TorchVision then we would absolutely love to have it. Rest assured that if we do add it, we will make sure to provide all the necessary credit to OSS contributors who made that possible. I know that your coding-styles and practices are very aligned with the ones used in TorchVision, so I agree v5 would probably be the easiest step forward.

I have a couple of questions for you:

Did you manage to reproduce the original results using your own implementation or you provide ported weights? Do you have access to a cluster to train such models (don't worry if you don't, we can work something out on our side)?
Other than the transforms you mentioned above which we already added/plan-to-add, what other augmentations are we missing? I suspect Mosaic was used for this model. Anything else?
I assume you will implement CSPDarknet for the backbone. Which other backbones do you plan to drop on the classification side?
Can you clarify whether your implementation is a complete rewrite or if you use components from a GPL3 implementation? If you didn't fully rewrite the code, it would be safer for us to reach out to the authors of the reference implementations and confirm they would be happy for us to progress.

Concerning the training engine, I complete agree we should refactor large part of our reference scripts to inherit and reuse components. My recommendation though is not to link this work to the addition of YOLO as this is already a very big project. There are also various potential solutions that we might want to leverage (for example TorchRecipes), but this will require additional chats.

I would suggest next steps is to clarify the above and decide how to progress. Possibly we will need to split the project to subprojects and potentially invite more contributors to help out. We could tackle this as part of https://github.com/pytorch/vision/issues/6323 and leverage the community.

zhiqwang commented 2 years ago

Hi @datumbox

Did you manage to reproduce the original results using your own implementation or you provide ported weights? Do you have access to a cluster to train such models (don't worry if you don't, we can work something out on our side)?

I have only verified on a few images with the ported weights, because there are some differences in our preprocessing part compared to the original yolov5 version, I have not been able to do a complete verification on the coco dataset, and I can add some more detailed comparison data in the next few days.

I don't have a server in my hand. We can divide the task into smaller parts and eventually put all the modules together, and then being able to train with your help is the best option.

Other than the transforms you mentioned above which we already added/plan-to-add, what other augmentations are we missing? I suspect Mosaic was used for this model. Anything else?

As you said the most important part of the data augmentation is the mosaic technique. It was first introduced in https://github.com/ultralytics/yolov3/issues/310#issuecomment-546435541, and there is a similar discussion in https://github.com/AlexeyAB/darknet/issues/3114#issuecomment-552158041. This mosaic technique is helpful to detect smaller size object. (And I think this is the key technology that allows the YOLO series to be trained from scratch.) I quote Jocher's conclusions below.

The smaller cars are detected earlier with less blinking and cars of all sizes show better behaved bounding boxes.

In addition, v5 also uses the following enhancements:

Seems that the random_fliplr augmentation is the RandomHorizontalFlip in TorchVision:

https://github.com/pytorch/vision/blob/b30fa5c13c2d3409c25c0bc706af10c608725617/references/detection/transforms.py#L30

zhiqwang commented 2 years ago

I assume you will implement CSPDarknet for the backbone. Which other backbones do you plan to drop on the classification side?

YOLO{v5/v7/X} train their detection models from scratch, and they now have backbone that is not the DarkNet presented in the original paper. It would be nice to have a pre-trained model on the ImageNet to help accelerate our training, but it is up for debate whether it is necessary to implement the original version of CSPDarknet.

Can you clarify whether your implementation is a complete rewrite or if you use components from a GPL3 implementation? If you didn't fully rewrite the code, it would be safer for us to reach out to the authors of the reference implementations and confirm they would be happy for us to progress.

The codes currently in this folder https://github.com/zhiqwang/yolov5-rt-stack/tree/main/yolort/models was written from scratch, I only called some common functions of YOLOv5. That parts have been rewritten by YOLOX, we can call YOLOX's common functions instead (or we can rewrite it ourselves) to get rid of this dependency. The main reason I used YOLOv5's common was to be able to load the checkpoints trained from YOLOv5.

Concretely I restructured the YOLOv5's yaml-parse mechanism into following three sub-modules in the layout of TorchVision:

datumbox commented 2 years ago

@zhiqwang I had the chance to investigate a bit further the references. The biggest concern about YOLOv5 is that there is still no paper to accompany the architecture (see https://github.com/ultralytics/yolov5/issues/1333); I remember that it first came out as a repo and the owners said that the paper will be coming out shortly but I don't think there is currently one. Though it's a very popular architecture which achieves good results, the lack of paper is a problem as we usually focus on canonical implementations and expansions that have been studied in research. YOLOX seems like a viable alternative. Perhaps that's the way forward to avoid licensing issues, wdyt?

I have not been able to do a complete verification on the coco dataset, and I can add some more detailed comparison data in the next few days.

Sounds good, I think it's worth confirming that the implementation yields the expected accuracy prior deciding to adopt it.

then being able to train with your help is the best option

Sounds good. We can follow a similar approach as with FCOS but with omitting the original training. We have capacity of training such network internally, so you don't have to have your own infra.

mosaic

We should implement the mosaic augmentation and add it on references at first. Then once @pmeier and @vfdev-5 are back, we can examine implementing them as transforms on the new API.

mixup

I'm a bit surprised to see it in the list (haven't check the references). Do they use mixup for detection? Do they adjust the probabilities of the labels? How about the boxes, are they multiplied also by weights?

It would be nice to have a pre-trained model on the ImageNet to help accelerate our training, but it is up for debate whether it is necessary to implement the original version of CSPDarknet.

I can confirm that training from scratch usually yields better results. I'm OK not adding a Darknet arch in TorchVision; it's kind of old.

oke-aditya commented 2 years ago

I can confirm that training from scratch usually yields better results. I'm OK not adding a Darknet arch in TorchVision; it's kind of old.

Agreed but as such we will need DarkNet to implement YOLOX. So it might just be better to provide it too. Although it's old it's still being used and relevant. It's quite a fundamental model like AlexNet. So somehow I feel it might be good to add.

zhiqwang commented 2 years ago

YOLOX seems like a viable alternative. Perhaps that's the way forward to avoid licensing issues

Hi @datumbox , agree on this one, It is a good choice to add YOLOX to the list first.

datumbox commented 2 years ago

@zhiqwang Sounds good. I'll need to dig a bit on the YOLOX paper and familiarize myself. I'll try to do this by EOW. In meantime, if you have in your mind a clear plan on intermediary milestones for adding YOLOX please add it here (aka addition of X arch, Y transforms, Z operators etc). This will hopefully let us to coordinate among contributors.

datumbox commented 2 years ago

@zhiqwang I'm late by 1 week. Sorry I got caught up on other pieces of work.

I've gone through the bibliography around YOLOX and here are some thoughts:

As noted earlier we will need to implement Mosaic Transform.
Extend Mixup for Object Detection (ref - section 3.1) cc @pmeier @vfdev-5
Implement DarkNet53 and/or Modified CSPNet. This depends on which variant sizes we want to offer. IMO, especially with modified CSPNet, I would probably want to avoid offering it as a standalone classification model of Vision.
YOLOX is more based on YOLOv3 rather than YOLOv5. This has the positive that it's based on published work but negative that it is less relevant to previous work you did on the YOLO5-RT-stack repo.

Mosaic and MixUp are worth implementing. I've added tasks for them on the #6323 issue. Whether we will go ahead with the rest depends on your bandwidth. Is this something you would like to pick up and lead? If yes, we can find a POC on our side that would assist with the model validation and training resources. Let me know, thanks!

zhiqwang commented 2 years ago

Hi @datumbox ,

YOLOX is more based on YOLOv3 rather than YOLOv5. This has the positive that it's based on published work but negative that it is less relevant to previous work you did on the YOLO5-RT-stack repo.

I agree with you here. YOLOX have a good balance in terms of copyright and code quality, and it's enough to have a YOLOX implementation from the community's perspective.

Whether we will go ahead with the rest depends on your bandwidth. Is this something you would like to pick up and lead?

Sorry for not having enough bandwidth to work here recently ( But I can help to review the codes and support for deployment if there is such a need.

datumbox commented 2 years ago

Sorry for not having enough bandwidth to work here recently ( But I can help to review the codes and support for deployment if there is also a need for this.

@zhiqwang Thanks for getting back to me. I completely understand. Unfortunately we are very constrained in terms of headcount and bandwidth at the moment. I don't think any of the maintainers can pick this up. Originally the idea of you picking up and leading this initiative was very promising as you have extensive experience with the YOLO architecture due to your earlier work. But I understand that since we will be interested in porting YOLOX and not v5, that would increase significantly your work. I'm happy to leave this open in case your situation changes on the future.

Since we are here, let me do the cheeky move and check if any of the original authors of YOLOX would be interested in contributing an implementation to TorchVision? @FateScript @Joker316701882 @GOATmessi7

FateScript commented 2 years ago

Thanks @zhiqwang for your proposal and @datumbox for inviting me. I would like to contribute a implemenentation of YOLOX to torchvision and I wonder what should I do in order to complete this goal ?

datumbox commented 2 years ago

@FateScript thanks for responding. We would love to have a modern YOLO iteration in TorchVision. Currently we don't offer any variant of this architecture which means that researchers can't do off-the-shelf comparisons.

I don't know how familiar you are with the TorchVision code-base. As with every library it has its own idioms and quirks, so this is an exercise of porting your original code to follow those idioms. I've listed a few thoughts on what needs to be done on the following comment, let me know your thoughts: https://github.com/pytorch/vision/issues/6341#issuecomment-1213020861

To summarize, we would need to implement specific backbones that are not supported + the architecture of YOLOX along with any utilities that are not already available in TorchVision. Hopefully we should already support many of such ops (bbox ops and IoU estimations, bbox encoding & matching, anchor utils (I'm aware YOLOX is anchor free) and ops. We can provide assistance in form of PR reviews and model training (using our own compute).

I'll leave you to check some of the references and let me know your thoughts. It would be really awesome to work with you. Being one of the original authors of YOLOX means it should be easier for you to adapt the implementation and faster for us to review it.

FateScript commented 2 years ago

@datumbox Sorry for my late reply — I'm on my vacation these days.

I check the references you mentioned above and I think implemented a YOLOX model in torchvision is not too hard. The main effort here are data transform and model arch. I decide to share one day per week to complete this.

BTW, Is there any DDL for me ?

datumbox commented 2 years ago

I decide to share one day per week to complete this.

@FateScript That's awesome, thanks a lot for doing this!

BTW, Is there any DDL for me ?

Sorry what do you mean by DDL?

But I can help to review the codes and support for deployment if there is such a need.

@zhiqwang Just wanted to check if you still want to be involved on supporting Feng during the PRs or if I should find a POC on our side for this. Totally depends on your bandwidth.

FateScript commented 2 years ago

Sorry what do you mean by DDL?

I mean, deadline

datumbox commented 2 years ago

@FateScript No deadlines from our side. We appreciate you are dedicating your time to an open-source project and we are thankful. :)

Just a date to keep in mind in case we aim to release the model available for v0.14. All PRs for that release need to be merged by beginning of October. Anything merged after that, will be released with v0.15.

zhiqwang commented 2 years ago

Just wanted to check if you still want to be involved on supporting Feng during the PRs or if I should find a POC on our side for this. Totally depends on your bandwidth.

Hi @datumbox and @FateScript , I believe Feng will implement a very superior version of YOLOX here, and I will contact him offline to see if there is anything I can do to help :)

datumbox commented 1 year ago

@FateScript I just wanted to follow up and see if you faced any blockers with the implementation. Let me know if we can help or if there is a change of plans. Thanks! :)

FateScript commented 1 year ago

Hi @datumbox , I haven't faced any blockers here with the implementation. The only bad news for us is that I transferd to a new work group and my new leader only allow me to share only a half day per week to complete this job. So it might takes me more time than expected.

datumbox commented 1 year ago

@FateScript Thanks for the heads up. No worries at all. You are donating your time and we are grateful for this. Just checking that you are not blocked by something or have abandoned it due to circumstances. Slow and steady wins the race; let me know if you need anything.

senarvi commented 1 year ago

I don't know if this is something that you'd like to consider, but I submitted an implementation of YOLOv3 and YOLOv4 to Lightning Bolts, and later submitted a pull request for features from YOLOv5, Scaled-YOLOv4, and YOLOX.

It's very flexible - you can use networks defined in torch such as YOLOv5 or networks defined in Darknet configuration files, and you can use different IoU functions from Torchvision and different algorithms (e.g. SimOTA) for matching targets to anchors, to construct the different YOLO variants. I haven't checked that I can reproduce the numbers from the papers, though. I think there are too many differences in the details between the different implementations, that it doesn't make sense to try to implement all of the variants exactly.

Anyway, I submitted the pull request a year ago and it has been accepted by the reviewers, but it still hasn't been merged. It seems like the Bolts project has gone pretty inactive. So if you're interested, I'd be happy to work on porting it to Torchvision and perhaps merging with the code from @FateScript ? It's clean code and well documented. You can have a look: https://github.com/groke-technologies/pytorch-lightning-bolts/tree/yolo-update/pl_bolts/models/detection/yolo

FateScript commented 1 year ago

@senarvi Thank you for sharing me with your clean code : ) Data providing logic of YOLOX is the main bottleneck here. I'm trying my best to write a clean version into torchvision. However, it could be helpful if you are willing to implement YOLOX module and contribute it to torchvision.

senarvi commented 1 year ago

@FateScript what exactly do you mean by data providing logic? I would think that all the models in torchvision.models.detection should use the same input format, so that you can use the same data pipeline and easily switch the model.

I'd just like to clarify that in my opinion it doesn't make sense to implement a module that's strictly YOLOX, because every year there's a new and improved YOLO version. I currently started looking into adding features from YOLOv7. I think it's better to have a generic YOLO module and reusable components that can be used to train a YOLOX model, but also used in the future with new YOLO versions, and ideally also with other model families. The most important components are the loss calculation, matching targets to anchors, and the network architectures. Torchvision already supports all the different IoU functions, so we should reuse those in the loss calculation. The network backbones could also be reused between other Torchvision models, although I think that currently the existing backbones such as ResNet only return features from the last layer. YOLO uses the FPN/PAN network with multiple detection heads that needs features from three or four backbone layers.

If you agree that that's the direction where we want to be heading, then I can create a pull request for you to have a look and comment.

FateScript commented 1 year ago

what exactly do you mean by data providing logic?

@senarvi Data providing logic here means that data related code such as data augmentation, dataset cacheing and so on.

I'd just like to clarify that in my opinion it doesn't make sense to implement a module that's strictly YOLOX, because every year there's a new and improved YOLO version.

So it seems that writing a YOLOX model into torchvision is meaningless, but other code like data augmentation is useful for torchvision?

senarvi commented 1 year ago

what exactly do you mean by data providing logic?

@senarvi Data providing logic here means that data related code such as data augmentation, dataset cacheing and so on.

I'd just like to clarify that in my opinion it doesn't make sense to implement a module that's strictly YOLOX, because every year there's a new and improved YOLO version.

So it seems that writing a YOLOX model into torchvision is meaningless, but other code like data augmentation is useful for torchvision?

@FateScript Definitely not meaningless. Maybe I explained myself poorly. I mean just that I can see two approaches:

From a benchmarking perspective it can be useful to have a model that's identical to a standard YOLO implementation such as YOLOX. Then you also need identical data augmentations etc. The downside is that if you want to add other YOLO versions in the future, it will be more difficult to reuse the components (if you want every version to be 1-1 identical to the original code).

Personally I would find it more useful to have a generic YOLO class, where it's easy to reuse features from different YOLO versions, because as much as possible is abstracted into separate classes. It would also be nice to have augmentations such as mosaic, but in my opinion those can be implemented separately. In my opinion it's most important that the augmentations can be reused in different models, and are not something YOLO specific. By the way, I'm not any kind of authority here. :) I guess it's a matter of "philosophy" of the Torchvision project, which way to go.

FateScript commented 1 year ago

@senarvi Yes, you are right. I misunderstand your meaning here. I also found it hard for me to write a generic useage data related code(Using fork and data caching to train a YOLOX model is not suitable for all models). However, I will try my best to solve this issue. Thanks for your code and kind advice : )

senarvi commented 1 year ago

I added a YOLOv7 architecture in the Bolts pull request. So now it supports the YOLO variants listed by @zhiqwang in the initial post.

The biggest new architectural change was the Deep Supervision, i.e. auxiliary detection heads. It required some thinking, because the DetectionLayer class represents a single layer - not all detection layers of the model - but in the end it was quite straightforward to implement. This approach is also more modular and necessary if we want to keep supporting Darknet configurations too. Now the same class is used for calculating the auxiliary loss. The only thing I'm not completely happy about is that the forward() method needs to return both the normalized probabilities for the user and the unnormalized probabilities for loss calculation.

I've tried to make the components reusable so that they can be easily used for building new models. For example, instead of a huge monolithic function that expects some complex data structures and supports only one algorithm for matching the predictions to the targets, there are these generic functions that take the predictions and targets.

Details of the SimOTA algorithm I've got from the YOLOX code, and they have changed considerably in YOLOv7:

Anchors are restricted to a smaller area per target (center prior).
There's no more penalty to anchors that are outside the target box.
There's a new restriction related to the anchor sizes. I haven't got the time to analyze all the differences yet.

@FateScript what do you think about using this codebase as the basis for the YOLO model? It shouldn't be too difficult to fit this into the Torchvision framework. If you think that it might be a good idea, I can try and rewrite the model class for Torchvision. If there are features that we're missing, in the data processing, in the SimOTA algorithm, or some other details that are not correct, and you have time, you could help there. Do you think that this code architecture would be suitable?

FateScript commented 1 year ago

@senarvi Thanks for your code : ) IMO, your code architecture is suitable here. If any help is needed, please feel free to contact me. I'm busy with language modeling and YOLOX v2 recently, so it's appreciated if you could contribute YOLO model into the torchvision codebase.

senarvi commented 1 year ago

I prepared the contribution. Just need an official approval from my employer.

senarvi commented 1 year ago

Just a quick update that some of my managers are not at the office at the moment. Hopefully I will get the approval next week.

senarvi commented 1 year ago

I got a permission from my employer Groke Technologies to contribute the YOLO model, and created this pull request: https://github.com/pytorch/vision/pull/7496

The model is quite well tried and tested, but I would appreciate any help with some things related to Torchvision integration. For example, I'm not sure if the unit tests work and I don't know how the pretrained weights are created.

oke-aditya commented 1 year ago

Amazing! This is really great

senarvi commented 1 year ago

It would be great to get some feedback, especially on the model factory functions. According to this issue, I should add a factory function for each model variant. There are dozens of YOLO variants. There have been something like ten notable YOLO versions and each had several variants (s, m, l, x, nano, tiny, etc.). As discussed above, faithfully implementing all of them is not feasible. We should decide whether we want to add as many variants as possible, or just some most important ones (which ones)? Before adding more variants, I'd also like to know if I've understood correctly what is wanted. I created an example for yolov4.

I understood that I have to train weights for the variants. I train the model on the COCO dataset using references/detection/train.py and then how do I submit the weights? Do I have to train weights for all the variants that I add? If yes, then I cannot add many variants.

Construction of the network is a bit different from the other detection models, because YOLO adds several detection layers in different levels of the network. The backbone is the network only up to the FPN (or the extension of FPN called PAN). The detection layers are placed within the PAN. Take a look at the YOLOV7Network class for example. It's not really possible to separate the FPN/PAN from the detection layers, so that we could have the FPN/PAN as part of the backbone. We can switch the backbone (up to the FPN) though, if the backbone provides outputs from different levels like here.

Also, I use the method validate_batch() to validate that the input is in correct format. I wonder if we should use the same function to validate the input in all detection models. Then we would know that all detection models use the same data format. That said, this pull request might blow up if we start to include too much of this kind of refactoring in the same pull request.

pytorch / vision