neuralmagic / sparseml

Libraries for applying sparsification recipes to neural networks with a few lines of code, enabling faster and smaller models
Apache License 2.0
2.01k stars 140 forks source link

[Prototyping] Roberta Demo #2275

Closed dbogunowicz closed 1 month ago

dbogunowicz commented 1 month ago

Scrapbook branch for this task: https://app.asana.com/0/1207078450218847/1207253984013681/f

Introduction

I am presenting my findings using jupyter notebook, so that everyone can take a look at the stdouts of the dev flow, without rerunning the code locally. The only exception is running the model in deepsparse. It looks like deepsparse engine and the notebook environment do not like each other very much.

The only codebase change required is modifying the SparseAutoModel, so it can work with the new (as of 1.7) recipe structure. The new recipes were only tested against LLMs, but this investigation shows that they are compatible with the "old" transformers, after applying small tweaks.

Dev Flow

  1. Fetch the model: cardiffnlp/twitter-roberta-base-sentiment-latest and the dataset it has been trained on, tweet_eval:sentiment. The calibration set used for quantization is the 512 samples from the training set.
  2. Create the W8A8 quantization recipe for the model. Kudos for Sara for helping with the details of the recipe, and to Alex for proof-reading it from the research perspective
  3. Apply the recipe to the model in the one-shot manner
  4. Reload the quantized model and evaluate it in SparseML -> the performance is much worse than the baseline, something we would need to potentially worry about later; the focus of this goal is to run inference in Deepsparse.
  5. Export the model.
  6. Finally attempt to run the exported model in DeepSparse engine. Sadly, the inference crashes:
2024-05-10 09:34:22 deepsparse.pipeline WARNING  Could not create v2 'text-classification' pipeline, trying legacy
DeepSparse, Copyright 2021-present / Neuralmagic, Inc. version: 1.7.1 COMMUNITY | (3904e8ec) (release) (optimized) (system=avx2, binary=avx2)
[7f24c7cc3000 >WARN<  operator() ./src/include/wand/utility/warnings.hpp:14] Generating emulated code for quantized (INT8) operations since no VNNI instructions were detected. Set NM_FAST_VNNI_EMULATION=1 to increase performance at the expense of accuracy.
DeepSparse, Copyright 2021-present / Neuralmagic, Inc. version: 1.7.1 (3904e8ec) (release) (optimized) (system=avx2, binary=avx2)
OS: Linux workstation-deployment-57c9d55774-jwr4k 5.15.0-105-generic #115-Ubuntu SMP Mon Apr 15 09:52:04 UTC 2024
Arch: x86_64
CPU: AuthenticAMD
Vendor: AMD
Cores/sockets/threads: [24, 1, 48]
Available cores/sockets/threads: [10, 1, 20]
L1 cache size data/instruction: 32k/32k
L2 cache size: 0.5Mb
L3 cache size: 16Mb
...
review-notebook-app[bot] commented 1 month ago

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB