huggingface / optimum

🚀 Accelerate training and inference of 🤗 Transformers and 🤗 Diffusers with easy to use hardware optimization tools
https://huggingface.co/docs/optimum/main/
Apache License 2.0
2.45k stars 435 forks source link

Segformer Support #382

Closed HugeBob closed 1 year ago

HugeBob commented 1 year ago

Feature request

Would love for Optimum to add support for transformers.SegformerForSemanticSegmentation

https://huggingface.co/docs/transformers/model_doc/segformer#transformers.SegformerForSemanticSegmentation

As best I could tell, semantic segmentation is not something that Optimum currently supports for any models (https://huggingface.co/docs/optimum/main/en/pipelines) would love for this to be improved!

Motivation

I use HuggingFace's Segformer for an image segmentation model I have and would love to improve my inference speeds.

Your contribution

I don't know what a PR is so I kind of doubt it.

michaelbenayoun commented 1 year ago

Hi @HugeBob, So if I understand correctly you would love to use a semantic-segmantation pipeline. It seems that this is not currently supported by transformers, so we will not support it on our end, until it is supported there.

TheoMrc commented 1 year ago

Hi @michaelbenayoun,

I had the same thing in head than Bob. From my understanding, some of the architectures from the transformers package, allows for sementic segmentation such as transformers.SegformerForSemanticSegmentation.

For example, nvidias Segformers (https://huggingface.co/nvidia/mit-b0), are apparently based on "a hierarchical Transformer encoder and a lightweight all-MLP decode head".

Those transformer based models, even though they are implemented in transformers cannot be optimized with optimum ?

Thank you,

Theo

michaelbenayoun commented 1 year ago

Hi @TheoMrc Yes, they can, we just need to support the ONNX export of those models actually. We do support the Segformer export, so you will be able to both export, optimize and quantize a Segformer model with ONNX Runtime.

For pipelines, it might not be usable because it was not available in transformers last time I checked.

What we can do on our end is to add support for an ORTModelForImageSegmentation. I will do it soon.

TheoMrc commented 1 year ago

Hi again,

Thanks for your answer, If you don't mind, I could need some points of clarification to better understand how things might turn up.

After some very interesting reading time in various documentations, I'm guessing from your answer that: 1) In order to optimize inference speed, I should consider converting my SegFormer pytorch models to ORT (ONNX runtime) models, and then applying ORTOptimizer and ORTQuantizer from optimum.onnxruntime. Links for mortals like me : Transformers export to ONNX runtime ; Optimum tutorial

2) Once optimized and quantized in the .onnx format, I should theoretically be able to load and do inferences with my model in my python app through the ORT's python API with some kind of weird session-based syntax - Python api ORT tutorial

3) Optimum is (among other things) some kind of python wrapper for ORT, that allows mortals like me to handily benefit from ORT with some HuggingFace's user-friendly syntax. That is what you plan to implement as ORTModelForImageSegmentation.

4) If my previous points are kind of accurate, the fact that SegFormers use transformer encoding layers might make them candidate for further optimization through BetterTransformer, after which I should convert to ONNX, optimize and quantize. See BetterTransformer example bellow Or maybe the "BetterTransformer" stuff only works upon torch-based inference and is not about the layer structure ?

BetterTransformer example from Hugging Face

from transformers import AutoModelForSequenceClassification
from optimum.bettertransformer import BetterTransformer
model_hf = AutoModelForSequenceClassification.from_pretrained("bert-base-cased")
model = BetterTransformer.transform(model_hf, keep_original_model=True)

Anyway thanks a lot for your time, I'm starting to feel like I should do an internship at huggingface to learn more about how these things work after my PhD !

See you around,

Theo

michaelbenayoun commented 1 year ago

To answer to each of your points:

  1. Yes, we also support the export in optimum now, and it is the recommended way; check here.
  2. Yes, in theory, but as you mention in point 3, we save you this pain by implementing wrappers hiding this logic.
  3. Yes
  4. So it is not possible to mix both for now, since the PyTorch kernel for BetterTransformer is not supported by ONNX Runtime. That being said, in a general case I would suggest to both try BetterTransformer in PyTorch and ONNX Runtime, and see what gives the best latency. In your case, Segformer cannot use BetterTransformer because it has some custom way of computing the FFNs.

Maybe! In any case do not hesitate if you have any questions, or want to contribute!

TheoMrc commented 1 year ago

Thanks once again for your answer.

Just a quick follow-up bellow 1) After late Sunday night investigations of Hugging Face's transformers tutorials I managed to save my local segformer model to .onnx Code bellow:

model = AutoModelForSemanticSegmentation.from_pretrained(model_path)
feature_extractor = SegformerFeatureExtractor()

model_kind, model_onnx_config = FeaturesManager.check_supported_model_or_raise(model, feature="semantic-segmentation")
onnx_config = model_onnx_config(model.config)

onnx_inputs, onnx_outputs = transformers.onnx.export(preprocessor=feature_extractor,  
                                                     model=model,  
                                                     config=onnx_config, 
                                                     opset=13, 
                                                     output=target_path)

2) Turns out that optimization is not supported for segformers Code bellow:

optimizer = ORTOptimizer.from_pretrained(onnx_model_path)
optimization_config = OptimizationConfig(optimization_level=99)

>>> KeyError: 'segformer model type is not supported yet. Only albert, bart, [...]

Although I don't mind since I have no idea what it does 😎. Sounded nice though, since it does not impact model outputs but appears to halve latency in some cases Optimum tutorial.

3) On the other hand, quantization worked :

quantizer = ORTQuantizer.from_pretrained(torch_model_path)
qconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=True)
quantizer.quantize(save_dir=quantized_onnx_path, quantization_config=qconfig)

(This, I read a bit about the theory)

4) I took some inspiration from transformers and optimum source codes for Pipeline and ORTmodel classes, and managed to grasp how they work. Turns out SegFormer work just fine with pipeline().

From this, I built my own custom Pipeline class with strategic outputs based on my own application (I basically want to output the segmentation map i.e. the argmax of all logits).

class CustomImageSegmentationPipeline(ImageSegmentationPipeline):
    def postprocess(self, model_outputs):
        logits = model_outputs.logits
        logits = nn.functional.interpolate(
            logits,
            size=model_outputs.target_size[0],  # (height, width)
            mode='bilinear',
            align_corners=False
        )

        segmentation_map = logits.argmax(dim=1)[0]
        return segmentation_map

# Creating instance
auto_model = AutoModelForSemanticSegmentation.from_pretrained(torch_model_path)
feature_extractor = SegformerFeatureExtractor()
hf_pipe = pipeline("image-segmentation", model=auto_model, feature_extractor=feature_extractor)
custom_pipe = CustomImageSegmentationPipeline(model=auto_model, feature_extractor=feature_extractor)

Everything worked perfectly for torch models (tested only on CPU).

inputs = feature_extractor(pil_image, return_tensors="pt")
print('Duration of the prediction with torch model:')
%timeit auto_model(**inputs)
>>> Duration of the prediction with torch model:
>>> 2.82 s ± 73.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

5) Using the ORTModel class from optimum, I loaded my onnx models, performed predictions, which were around twice as fast than torch inference:

onnx_model = ORTModel(onnx_path)
quantized_model = ORTModel(quantized_path)
onnx_inputs = feature_extractor(pil_image, return_tensors="np")

print('\nDuration of the prediction with onnx model:')
%timeit onnx_model.session.run(None, input_feed=onnx_inputs)
print('\nDuration of the prediction with quantized onnx model:')
%timeit quantized_model.session.run(None, input_feed=onnx_inputs)

>>> Duration of the prediction with onnx model:
>>> 1.68 s ± 113 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> Duration of the prediction with quantized onnx model:
>>> 1.45 s ± 50.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Although, I could not manage to implement a custom transformers.Pipeline because several class attributes are not implemented in ORTModel (ORTModel.config for example) which are necessary upon Pipeline.__init__ and Pipeline._sanitize_parameters calls.

For now, I just built custom "pipeline functions" that work on cpu only, but avoid unecessary tasks performed in implemented Pipeline classes, doing only what is necessary for my goal

def custom_onnx_workflow(image, onnx_model):
    inputs = feature_extractor(image, return_tensors="np")
    onnx_inputs = {'pixel_values': inputs['pixel_values']}
    outputs = onnx_model.session.run(None, input_feed=onnx_inputs)
    upsampled_logits = nn.functional.interpolate(
        torch.from_numpy(outputs[0]),
        size=image.size[::-1],  # (height, width)
        mode='bilinear',
        align_corners=False
    )
    segmentation_map = upsampled_logits.argmax(dim=1)[0]
    return segmentation_map

Next step for me is to enable GPU coverage, which I am sure I will find how-to in optimum.ORTModel source code, for example in the ORTModelForImageClassification source code.

I'd love to try to actually implement it and do a PR for ORTModelForSemanticSegmentation, which would be supported in pipelines. I'm guessing the forward method will have to return a SemanticSegmenterOutput instead of an ImageClassifierOutput.

Appart from this, I think everything will actually be the exact same as the ORTModelForImageClassification.

Have you already started writing this class ? Else, any other obvious advice ?

Thanks for your time,

Theo

michaelbenayoun commented 1 year ago

Hi @TheoMrc,

First, thank you for your feedbacks, they are very valuable!

About you questions:

  1. You can also convert it using optimum.exporters.onnx . It is the suggested way since it is more up-to-date. The API is mostly similar, and you can do it via a command-line:
python -m optimum.exporters.onnx --model model_name --task semantic-segmentation segformer_onnx
  1. Basically, among other things, the ORTOptimizer will look for patterns, and try to fuse operations together (such as attention), and we support common patterns (BERT, GPT-2. etc). I will need to check if segformer can be supported.

  2. That is nice, it would be interesting to check what operators end up quantized, I feel like the speed-up compared to the non-quantized ONNX model is small, you can get more. Performing graph optimization before would also help.

  3. ORTModel do have a config attribute, altough it might not always be set.. I am currently working on improving and cleaning the ORTforXXX classes to avoid such cases and make the API easier. But you're right, adding a ORTModelForImageSegmentation is the first step. Writing such class would most likely consist in preparing the IO binding (cc @JingyaHuang) and returning the proper output class as you mentioned. For this, ORTModelForImageClassification is a great example to follow. I have not started working on it by the way.

You can open a PR and I can help you there, what do you think?

JingyaHuang commented 1 year ago

Hi @TheoMrc,

Just to expand on the second point of @michaelbenayoun, as Segformer is based on transformer encoder architecture, we can apply a bert-like optimization, by registering Segformer in ORTManager

(But there is a caveat due to the fact that Segformer's encoder blocks have different hidden_size, I am not sure if this has been taken into consideration in ONNX Runtime(although ort supports automatical shape inference to get hidden_size and num_head with Reshape nodes), better check.)

And if you are interested in contributing the ORTModelForImageSegmentation class, please feel free to tag me.

JingyaHuang commented 1 year ago

With a quick test, the automatic detection of hidden_size and num_head works. I got some fused nodes(level=99) like the following: image image

I am thinking of letting BERT-like model infer hidden_size and num_head themselves instead of reading from the config in these cases(various hidden_size for blocks), WDYT? @michaelbenayoun

Ref: https://github.com/microsoft/onnxruntime/blob/441b30b2d26d36ca1db2930ade2fe82622ce0cd4/onnxruntime/python/tools/transformers/onnx_model_bert.py#L47

TheoMrc commented 1 year ago

Hi @michaelbenayoun and @JingyaHuang,

Thanks to you both for your answers, as a "novice" in the field, I find it personnally extremely useful to speak with you. I started ML in python through a from "scratch" manner in tensorflow, and then torch, for which I had to grasp more of the theory, create new loss functions, ... Hugging Face is very nice but hides most the complicated stuff, which is very handy to get working prototypes but surely makes it easy to ignore the way this work. I surely plan to learn and understand everything :)

Michael:

You can open a PR and I can help you there, what do you think?

I will clone the optimum repo and open a PR once I have a first (hopefully working) version of the ORTModelForImageSegmentation and tag you both for review !

Before my previous answer, following your advice, I first had tried this from commandline with optimum :

python -m optimum.exporters.onnx --model model_name --task semantic-segmentation segformer_onnx

But initially failed because I tried to input the path to pytorch_model.bin as model_name instead of parent directory (actually, might also be because I did not input a task). I then went down a level and managed conversion through transformers.onnx (cf. my previous answer). Anyways, worked out just fine once I tried with your command and the right inputs, thanks for the tip !

Michael:

  1. Basically, among other things, the ORTOptimizer will look for patterns, and try to fuse operations together (such as attention), and we support common patterns (BERT, GPT-2. etc). I will need to check if segformer can be supported.

Jingya:

Just to expand on the second point of @michaelbenayoun, as Segformer is based on transformer encoder architecture, we can apply a bert-like optimization, by registering Segformer in ORTManager

Being a computer vision guy (and a biologist), I only use segformers from HuggingFace :

I'd obviously enjoy any performance gain from segformer optimization support ! Once again, I'd love to contribute in order to improve my understanding of what's going on behind the nice Hugging Face syntax.

Michael:

  1. That is nice, it would be interesting to check what operators end up quantized, I feel like the speed-up compared to the non-quantized ONNX model is small, you can get more. [...]

To be noted, my quantized_model.onnx file (117 Mo) is half the size when compared to the original model.onnx (246 Mo). Not sure how relevant this is. I'm guessing that around half the weights were converted from float32 to int8. 32bits/8bits = 4, so theoretical max "size" gain, would be about 4-times smaller. (Probably oversimplyfing though, the model is obviously not just a bunch of weights)

I don't know how to check what was quantized, maybe you could redirect me to some documentation ?

As well, I tested inference on my old-ish laptop CPU which tends to overheat, and latency is quite variable. I'll test inference on my main machine and comeback with more reliable latency data.

Thanks again for your time, see you soon after my PR

fxmarty commented 1 year ago

Fixed in https://github.com/huggingface/optimum/pull/539