UNETR: Transformers for 3D Medical Image Segmentation

pri1311 commented 2 years ago

Model description

I would like to add a new model:

Proposed in the paper: UNETR: Transformers for 3D Medical Image Segmentation

UNEt TRansformers (UNETR) utilize a transformer as the encoder to learn sequence representations of the input volume and effectively capture the global multi-scale information, while also following the successful "U-shaped" network design for the encoder and decoder. The transformer encoder is directly connected to a decoder via skip connections at different resolutions to compute the final semantic segmentation output.

Open source status

[X] The model implementation is available
[X] The model weights are available

Provide useful links for the implementation

Model Implementation: https://github.com/Project-MONAI/research-contributions/tree/master/UNETR

Pretrained Model: https://drive.google.com/file/d/1kR5QuRAuooYcTNLMnMj80Z9IgSs8jtLO/view?usp=sharing (Based on BTCV dataset)

Puranjay-del-Mishra commented 2 years ago

Hello. What is the status of the implementation? I would like to contribute to it.

LysandreJik commented 2 years ago

Hey @Puranjay-del-Mishra, to the best of my knowledge nobody has started working on it. We'd be very happy for you to take a stab at adding it!

You can follow the tutorial here: adding a new model.

We especially recommend following the add-new-model-like command and guide.

If you have not contributed to transformers yet, we also recommend reading the contributing guide.

Puranjay-del-Mishra commented 2 years ago

Sure! @LysandreJik I'll go through it and give it a shot. Thanks.

pri1311 commented 2 years ago

Hey @Puranjay-del-Mishra @LysandreJik I was supposed to submit a PR last week but I came down with health problems. I will be sending a PR by the weekend.

Puranjay-del-Mishra commented 2 years ago

Hey @pri1311 , go ahead with the PR. All the best.

Wernstrong67 commented 2 years ago

I'm gonna try this out. Appreciate it.

arv-77 commented 1 year ago

Hi @NielsRogge, Can I have a shot at implementing this model?

NielsRogge commented 1 year ago

Yes, sure! Do you need some help?

arv-77 commented 1 year ago

Thanks! I'll get back to you if I have queries

caleb-vicente commented 1 year ago

Hello @NielsRogge. I have been following all the steps depicted in the guide https://huggingface.co/docs/transformers/add_new_model. I have already done all previous step to create a PR. At this moment I have a fork on my github of the whole transformer-HuggingFace project and I have created my "draft" copying VIT by using the command "transformers-cli add-new-model-like". After that, I created a draft pull request from my dev-fork-branch to my main-fork-branch and I tried to include you as a reviewer, but It was not possible. Am I missing some steps? Should the pull request be done directly from my dev-fork-brach to some branch in the real repository?

Attaching snapshot of the problem: error_adding_reviewers

caleb-vicente commented 1 year ago

Hi @NielsRogge and @LysandreJik,

I have been working on this task for the last few weeks and my code is doing the forward pass properly. Now I am implementing the tokenizer but I have a doubt. In the original repository they have created many functions to transform input images. Can I include this function/library as a requirement for the HuggingFace tokenizer or they must be implemented from scratch?

Many thanks

NielsRogge commented 1 year ago

Hi,

UNETR is a vision model so it probably doesn't require a tokenizer? You probably want to create an image processor for this model, is that right?

In that case, image processors should be implemented to support minimal inference. They should perform the exact same transformations as the original implementation to prepare data for the model for inference. For computer vision models, this typically involves resizing to a particular size + normalization.

An example of an image processor can be found here: https://github.com/huggingface/transformers/blob/main/src/transformers/models/vit/image_processing_vit.py

caleb-vicente commented 1 year ago

Thank you for the answer @NielsRogge.

When I was talking about the tokenizer I was meaning in fact the image_processor. When you check how the original repository implements the model, you realize they are using some transformations not implemented in Hugging Face library. These transormations normalize, filter and resize the 3d image in particular ways, with an slightly complex hierarchy of functions that can not be implemented with the current functions you can find in the "image_processing_utils.py"

As far as I can see there are three options to implement this part in the Hugging Face code:

Use exactly the same functions they use in the original project (importing libraries of the monai project) https://github.com/Project-MONAI/MONAI
Copy/paste the code (of the monai project) in the image_processing_utils.py and addapt the style and names to make it more legible.
Implement from scratch the whole code. This could be time-consuming and pretty hard to obtain same results as in the original code.

What is the recommended option?

NielsRogge commented 1 year ago

Thanks for the nice suggestions! I'll ping @amyeroberts for this, as she's currently working on refactoring our image processing pipelines.

caleb-vicente commented 1 year ago

Thank you Niels.

Please let me know when you have some info. I'll be working in the refactor of the UNETR decoder since the forward pass is using currrently a dependency of the monai project (original project) as well.

NielsRogge commented 1 year ago

Discussed this offline with @amyeroberts, here's what she responded:

I’d use the third party for now (with usual xxx_is_available checks) and wrap inside the image processor e.g. import thirdparty

class MyImageProcessor:
    def transform_1(self, image, *args, **kwargs):
        image = thirdparty.transform_1(image, *args, **kwargs)
        ...

so that we can remove easily if needs be. Looking at the MONAI library: Torch is required. This is fine for implementing the first model, but shouldn’t be necessary for our TF model users. If the model turns out to be popular it would be good to remove this dependancy so we can port easily. Most of the transforms listed are compositions of standard logic we already have e.g. CropForeground would only require us implementing logic to calculate the bounding box.

amyeroberts commented 1 year ago

@caleb-vicente Thanks for all your work so far adding this model ❤️

Adding to Niels comment above:

Regarding your suggestions, option 1 is the one I would go for: importing specific functionality from the MONAI project. I completely agree we don't want to reinvent the wheel! We already use third party packages for certain processing e.g. pytesseract for the LayoutLM models. Like the LayoutLM models, we can add MONAI as an optional dependency.

Regarding transforms in the screenshot above, one thing to consider is the image processors don't perform augmentation, they are responsible for transforming the data so that it can be fed into the model i.e. the UterImageProcessor shouldn't have the random operations like RandFlipd.

In the snippet:

class MyImageProcessor:
    def transform_1(self, image, *args, **kwargs):
        image = thirdparty.transform_1(image, *args, **kwargs)
        ...

there's also the consideration about input types. All of the current functions take in and return numpy arrays and it should be possible to disable any of the transforms e.g. do_resize=False. As far as I can tell, MONAI will accept both torch and numpy, but always returns torch arrays. This is OK for a first implementation before removing the torch dependency as long as the ability to disable any of the transforms still applies.

Let me know if there's any other questions you have regarding this :)

caleb-vicente commented 1 year ago

Hello @NielsRogge and @amyeroberts,

Thank you so much for the answers. Please find a few comments below:

I will implement the optional dependency with the monai library.
For the first implementation I will use functions as they are in the library. For next iterations I could simplify some of them using work already done in the Hugging Face library.
About data augmentation I will review it again to see if I can find any of those in MONAI inference phase. In this case the function RandFlipd is used only in training mode in the Notebook from which I took the snapshot (Sorry for the confusion).
I will add a layer on top of MONAI's dependencies so that everything works with numpy arrays if necessary. Additionally the possibility to decide will be included.

I will keep you updated about the progess or any doubt :)

huggingface / transformers