Neural compressor Proposal - to add port the repo under Intel to ONNX organization

liqunfu commented 2 years ago

ONNX Model Compressor

Quantization Tool Proposal

Intel Neural Compressor(INC) is a tool for generating optimized ONNX models and supports techniques like Post training quantization (PTQ), Quantization Aware Training (QAT). The tool can also be used for distillation and Pruning to generate Sparse quantized ONNX models. It has broad model coverage (300+ models) representing key domains like vision, NLP and recommendation systems. Since its release INC has seen high popularity among the ONNX community. It has also been integrated into Huggingface Optimum pipeline. INC is also the tool used to produce int8 quantized models in ONNX model Zoo.

While ONNX ecosystem is seeing high adoption in industry there hasn’t been significant community contribution towards ONNX model compression tooling. Hence Intel wants to contribute an open-source project to ONNX community which can help accelerate deployment of sparse and quantized ONNX models.

Proposal

Migrate Intel Neural Compressor to https://github.com/onnx/neural-compressor

Maintain a vendor neutral branding(Neural Compressor) and welcome community contributions to enhance Neural Compressor with broader HW support.

Questions:

(Question proposed by Tangri, Saurabh from Intel) How would Neural Compressor scale to non-intel Hardware?
(Question proposed by Tangri, Saurabh from Intel) Why not remove support non-ONNX models in Neural Compressor?

Answer: (by Tangri, Saurabh from Intel) I feel interoperability has been a strength of ONNX standard since its inception, and a quantization tool that supports other frameworks should be seen as an expression of that same openness. Yes we can remove/move non-onnx perf data on the landing page, so we don’t appear to be promoting non-onnx frameworks.

Follow-up question: How is model pruning and distillation related to ONNX?

Regarding requirements in Rules for all repos and Requirements for new, contributed repos: Who will be actively maintaining the repo?
“There are some questions raised about the tool, particularly around expansion to non-Intel hardware”. How to expand the tool to non-intel hardware?
(Gary from Micrisoft) Some of the Intel code is redundant with some of what we have in microsoft/onnxruntime and microsoft/onnxconvertercommon. I think it would be better to collaborate on one set of tools. How will the tool being used by converters and onnxruntime?

Rules for all repos and Requirements for new, contributed repos

Rules for all repos

Must be owned and managed by one of the ONNX SIGs (ArchInfra SIG)
Must be actively maintained (Who will be actively maintaining the repo?)
Must adopt the ONNX Code of Conduct (check)
Must adopt the standard ONNX license(s) (already Apache-2.0 License)
Must adopt the ONNX CLA bot (check)
Must adopt all ONNX automation (like LGTM) (check)
Must have CI or other automation in place for repos containing code to ensure quality (needs CI pipelines with good code coverage)
All OWNERS must be members of standing as defined by ability to vote in Steering Committee elections. (check)

Requirements for new, contributed repos

We are happy to accept contributions as repos under the ONNX organization of new projects that meet the following requirements:

Project is closely related to ONNX ((Question proposed by Tangri, Saurabh from Intel) Why not remove support non-ONNX models in Neural Compressor?)
Adds value to the ONNX ecosystem (check)
Determined to need a new repo rather than a folder in an existing repo (Is it possible to move into Onnx Optimizer?)
All contributors must have signed the ONNX CLA (check)
Licenses of dependencies must be acceptable (check)
Committment to maintain the repo (Who will be actively maintaining the repo?)
Approval of the SIG that will own the repo
Approval of the Steering Committee

bkaruman commented 2 years ago

Questions

How would Neural Compressor scale to Non-Intel Hardware? Neural Compressor produces ONNX compliant models with standard ONNX ops. Compliant models scale across diverse hardware, however if a hardware architecture has affinity to a specific technique (e.g. a specific sparsity pattern), the tool should be able to accommodate such platform specific extensibility knobs. We are open to receiving feedback and welcome community contributions to ensure abstractions are extensible to different platform architectures .
I'm personally thinking that it may be better to keep the Neural Compressor in ONNX focusing on ONNX only. What do you think please? Interoperability has been a strength of ONNX standard since its inception, and a quantization tool that supports other frameworks should be seen as an expression of that same openness. Furthermore, allowing non-ONNX formats will encourage collaborators from other AI ecosystems to contribute their state of the art quantization algorithms, bringing in additional value to ONNX ecosystem. We also plan to provide model export feature to support the conversion from non-ONNX formats to ONNX formats. We can probably remove/move non-ONNX perf data on the landing page, so we don’t appear to be promoting non-ONNX frameworks.
How is model pruning and distillation related to ONNX? Quantization, pruning and distillation are popular model compression techniques to produce compressed models that preserve accuracy. Neural Compressor supports these techniques to produce pruned/distilled models using standard ONNX ops.
Some of the Intel code is redundant with some of what we have in microsoft/onnxruntime and microsoft/onnxconvertercommon. I think it would be better to collaborate on one set of tools. How will the tool be used by converters and ONNX Runtime? ONNX Runtime quantization tool is related and offers an excellent opportunity for collaboration. We should be open for such an opportunity and look forward to receiving feedback. That said, converging to a single tool (compressor + convertor) will make maintainability overly complex and may not yield sufficient ROI. We would recommend keeping convertors separate.
Must be actively maintained (Who will be actively maintaining the repo?) Intel team can commit to active maintenance. We welcome contributors from other vendors.
Is it possible to move Neural Compressor into ONNX Optimizer? Neural Compressor is meant to compress models and has little in common with ONNX Optimizer which does graph transformations
Must adopt the ONNX DCO bot Intel team is committing to use DCO bot.
Must adopt all ONNX automation (like LGTM) Intel team is committing to allow LGTM scanning bot
Must have CI or other automation in place for repos containing code to ensure quality (needs CI pipelines with good code coverage) Intel team is committing to use CI pipelines similar to other ONNX projects.
All OWNERS must be members of standing as defined by ability to vote in Steering Committee elections. Yes, they are.
Do you plan to move the entire repo from [https://github.com/intel/neural-compressor] Intel is committing to move the active development to ONNX repo, while keeping Intel repo to maintain the possible support for non-ONNX that might be specific to Intel platforms if any.
Which SIG do you think this would be under? I was assuming arch/infra, but open to others. This has been discussed with both Infra and Convertors SIG. We are seeking support from infra SIG to own the project. Intel continues managing the codes (PRs/RFCs) and issues as a starting maintainer, while we are open to have more maintainers from the community by the contribution guideline.

AlexandreEichenberger commented 2 years ago

Link to current project (correct if wrong please) https://github.com/intel/neural-compressor

bkaruman commented 2 years ago

Link to current project (correct if wrong please) https://github.com/intel/neural-compressor

Thank you. That's the right link.

AlexandreEichenberger commented 2 years ago

Additional questions (from the steering committee meeting on July 20th):

this tool take models from TF, PyTorch, ONNX and emit TF, PyTorch, ONNX models. Does it have the ability to do the cross product, namely take a model in one format and emit it in a different format?
what is the support provided by the tool to support different architectures, and does the Intel team plan to facilitate the framework needed to support different architecture while promoting reuse of the parts that may be common to some architectures.

bkaruman commented 2 years ago

Additional questions (from the steering committee meeting on July 20th):

this tool take models from TF, PyTorch, ONNX and emit TF, PyTorch, ONNX models. Does it have the ability to do the cross product, namely take a model in one format and emit it in a different format? We don't see a strong need for model conversion across frameworks, but we are open for further discussion.

what is the support provided by the tool to support different architectures, and does the Intel team plan to facilitate the framework needed to support different architecture while promoting reuse of the parts that may be common to some architectures. The tool has a modular design to support different architectures with shared code, such flexibility would also benefit other architectures.

liqunfu commented 2 years ago

I have heard 2 major concerns:

Model distillation and pruning are not for ONNX models. My thinking is that both the functionalities depend on the framework to be able to train a model. This is luck in the pure ONNX framework. On the other hand, one can distill or prune a model and simply convert the compressed model to ONNX after. In this sense, the concern is not significant. This may also partially answer @AlexandreEichenberger's question on the "cross product" ability.
There is no CI pipeline setup to run test. Intel team promised to add CI pipeline with good test coverage. I think we do not have concern on this item then.

Given that, I would recommend migrate neural compressor repo to ONNX organization.

AlexandreEichenberger commented 2 years ago

Rajeev

WIth respect to point 1, Liqun point highlighted something that may have been glossed over during the presentation. All I remember was a graph where inputs was either ONNX/TF/Pytorch.

Liqun seems to imply that to distill and prune, the software cannot take onnx models, it has to take a TF/Keras/Pytorch model, then do its magic, and the it relies on the framework ONNX export capabilities to gen the final ONNX model?

Not sure if that is correct or not, would be great to get a slide that show the logical steps that happen in the distill / prune process and which frameworks / tools are relied upon to do this process

Tx

Alexandre

Alexandre Eichenberger, Principal RSM, Advanced Compiler Technologies

research: compiler optimization (ONNX, OpenMP, GPU, SIMD)
info: @.**@.> https://researcher.watson.ibm.com/researcher/view.php?person=us-alexe
phone: 914-312-3618 (cell)

rajeevnalawadi commented 2 years ago

Hi Alex, Yes, we will work towards providing the flow for QAT style scenario as applicable in context of distillation / pruning scenarios Thanks, Rajeev

From: Alexandre Eichenberger @.> Sent: Wednesday, August 10, 2022 2:29 PM To: onnx/steering-committee @.> Cc: Subscribed @.***> Subject: Re: [onnx/steering-committee] Neural compressor Proposal - to add port the repo under Intel to ONNX organization (Issue #52)

Rajeev

WIth respect to point 1, Liqun point highlighted something that may have been glossed over during the presentation. All I remember was a graph where inputs was either ONNX/TF/Pytorch.

Liqun seems to imply that to distill and prune, the software cannot take onnx models, it has to take a TF/Keras/Pytorch model, then do its magic, and the it relies on the framework ONNX export capabilities to gen the final ONNX model?

Not sure if that is correct or not, would be great to get a slide that show the logical steps that happen in the distill / prune process and which frameworks / tools are relied upon to do this process

Tx

Alexandre

Alexandre Eichenberger, Principal RSM, Advanced Compiler Technologies

research: compiler optimization (ONNX, OpenMP, GPU, SIMD)
info: @.**@.mailto:***@***.******@***.***> https://researcher.watson.ibm.com/researcher/view.php?person=us-alexe
phone: 914-312-3618 (cell)

— Reply to this email directly, view it on GitHubhttps://github.com/onnx/steering-committee/issues/52#issuecomment-1211295983, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACYYZYLYCV62PPHQSNPBALTVYQNJ3ANCNFSM5VO4LBIA. You are receiving this because you are subscribed to this thread.Message ID: @.**@.>>

rajeevsrao commented 2 years ago

We have the following primary objections to the inclusion of Neural Compressor tool into onnx:

Given the differences across IHVs on preferred approaches for Quantization and Pruning, ONNX should not endorse a tool under the de-facto control of a specific ISV. ONNX is foremost an interchange format.
While this tool supports onnx-to-onnx transformations such as PTQ, the other capabilities, QAT/Distillation/Pruning techniques pertain to training workflows and frameworks. We believe that this is not closely related to the onnx ecosystem, and given that is no standardization of these techniques across TF/Keras/Torch/MxNet, it wouldn't be prudent to include it wholesale under the LFAI/onnx project.

We hope to continue this discussion in the ONNX SIG/SC meetings. Thanks.

rajeevnalawadi commented 2 years ago

Regarding the below concerns, :

Given the differences across IHVs on preferred approaches for Quantization and Pruning, ONNX should not endorse a tool under the de-facto control of a specific ISV. ONNX is foremost an interchange format. “Neural compressor” tool provides the flexibility of defining HW capability configure file to support the model compression on multiple HWs, while most of the compression algorithms are HW agnostic (e.g., KL, min-max) and can be shared across different HW backends.
While this tool supports onnx-to-onnx transformations such as PTQ, the other capabilities, QAT/Distillation/Pruning techniques pertain to training workflows and frameworks. We believe that this is not closely related to the onnx ecosystem, and given that is no standardization of these techniques across TF/Keras/Torch/MxNet, it wouldn't be prudent to include it wholesale under the LFAI/onnx project. If there are implementations in runtime FW’s for direct training on ONNX and seems like there is currently training implementations of ONNX Runtime as module utilizing the PyTorch front-end. Frameworks are currently being used as test vehicle for “Neural compressor” tool to exercise the algorithms of during-training optimizations. We are committing to support the similar during-training optimizations on ONNX Runtime/FW’s once it’s ready. We already uploaded tens of optimized ONNX models (quantization as first phase) to ONNX model zoo for ecosystem adoption. Hoping to continue addressing any other concerns that are brought up.

From: Rajeev Rao @.> Sent: Thursday, August 25, 2022 1:53 PM To: onnx/steering-committee @.> Cc: Nalawadi, Rajeev K @.>; Comment @.> Subject: Re: [onnx/steering-committee] Neural compressor Proposal - to add port the repo under Intel to ONNX organization (Issue #52)

We have the following primary objections to the inclusion of Neural Compressor tool into onnx:

Given the differences across IHVs on preferred approaches for Quantization and Pruning, ONNX should not endorse a tool under the de-facto control of a specific ISV. ONNX is foremost an interchange format.
While this tool supports onnx-to-onnx transformations such as PTQ, the other capabilities, QAT/Distillation/Pruning techniques pertain to training workflows and frameworks. We believe that this is not closely related to the onnx ecosystem, and given that is no standardization of these techniques across TF/Keras/Torch/MxNet, it wouldn't be prudent to include it wholesale under the LFAI/onnx project.

We hope to continue this discussion in the ONNX SIG/SC meetings. Thanks.

— Reply to this email directly, view it on GitHubhttps://github.com/onnx/steering-committee/issues/52#issuecomment-1227751804, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACYYZYOSMTY6YZXJVFYKVYLV27MLLANCNFSM5VO4LBIA. You are receiving this because you commented.Message ID: @.**@.>>

gramalingam commented 1 year ago

I am curious about @rajeevsrao's comment "the differences across IHVs on preferred approaches for Quantization". Can you clarify what differences? Specifically, I think there are two aspects here.

The first is: How to represent a quantized model in ONNX? It would be great if we could all agree on this, as this is, effectively, part of the ONNX spec.

The second is: When multiple quantizations are possible (within the ONNX standard), the efficiency of these different quantizations may vary across different hardware platforms. Here, I can understand the variability across HW. I assume Rajeev Rao's comment is about this?

Just trying to find out if there is any difference in opinion about the first point/question (namely, the use of QDQ, Quantize and DeQuantize ops, etc.)

rajeevsrao commented 1 year ago

@gramalingam I agree there are two parts to the question as you said.

The first is: How to represent a quantized model in ONNX? It would be great if we could all agree on this, as this is, effectively, part of the ONNX spec.

Our preference is QDQ based representation over quantized operators - the former provides more flexibility and limits the explosion of operators in the spec.

The second is: When multiple quantizations are possible (within the ONNX standard), the efficiency of these different quantizations may vary across different hardware platforms. Here, I can understand the variability across HW. I assume Rajeev Rao's comment is about this?

Correct, this relates to tooling and not the ONNX spec, and was the focus of my earlier comment. The performance characteristics of valid ONNX graphs using Q-DQ nodes could vary significantly based on choices made for Q-DQ node placement and the hardware used. In this respect there is no standardization of the techniques or heuristics used for node placement, and developers are expected to rely on IHV provided tools for optimizations targeting specific hardware.

liqunfu commented 1 year ago

@rajeevnalawadi , thank you for quick reply! From what I see, neural compressor inserts QDQ for individual ONNX node: https://github.com/intel/neural-compressor/blob/8c2330d0b467f64f6114323f7fc133ff61025cfc/neural_compressor/adaptor/ox_utils/quantizer.py#L213

I may have missed something because I am still new to the code base. It seems that its implementation is driven by other frameworks (mxnet/tf/pt/onnxrt) and is configurable.

bkaruman commented 1 year ago

@rajeevnalawadi , thank you for quick reply! From what I see, neural compressor inserts QDQ for individual ONNX node: https://github.com/intel/neural-compressor/blob/8c2330d0b467f64f6114323f7fc133ff61025cfc/neural_compressor/adaptor/ox_utils/quantizer.py#L213

I may have missed something because I am still new to the code base. It seems that its implementation is driven by other frameworks (mxnet/tf/pt/onnxrt) and is configurable. Yes. The tool offers the flexibility to support QDQ/QLinear for ONNX nodes and can be configured based on HW. You can find the list of validated models (both QDQ and QLinear) under ONNX Runtime section here - https://github.com/intel/neural-compressor/blob/master/docs/validated_model_list.md

hshen14 commented 1 year ago

@gramalingam I agree there are two parts to the question as you said.

The first is: How to represent a quantized model in ONNX? It would be great if we could all agree on this, as this is, effectively, part of the ONNX spec.

Our preference is QDQ based representation over quantized operators - the former provides more flexibility and limits the explosion of operators in the spec.

The second is: When multiple quantizations are possible (within the ONNX standard), the efficiency of these different quantizations may vary across different hardware platforms. Here, I can understand the variability across HW. I assume Rajeev Rao's comment is about this?

Correct, this relates to tooling and not the ONNX spec, and was the focus of my earlier comment. The performance characteristics of valid ONNX graphs using Q-DQ nodes could vary significantly based on choices made for Q-DQ node placement and the hardware used. In this respect there is no standardization of the techniques or heuristics used for node placement, and developers are expected to rely on IHV provided tools for optimizations targeting specific hardware.

Right. What the tool is providing is 1) an infrastructure to quantize model in Q-DQ manner; 2) built-in Q-DQ quantization recipes for most HW vendors; and 3) the capability of adding custom Q-DQ recipes. We are looking for the collaborations with HW vendors to improve the infrastructure and custom recipes as well. Here is one example: https://github.com/intel/neural-compressor/blob/master/neural_compressor/utils/options.py#L22

andife commented 1 year ago

The request was approved by the Steering Committee. The tool will be part of the newly formed Optimizer SIG.

onnx / steering-committee