NM: error: Node (/model/Add_1) Op (Add) [ShapeInferenceError] Incompatible dimension

mneedham commented 9 months ago

Hi,

I wanted to try out the Llama2 model that you recently published, but I can't get it working. I'm using Docker on a Mac M1. So I downloaded the deployment.tar.gz file from this page - https://sparsezoo.neuralmagic.com/models/llama2-7b-gsm8k_llama2_pretrain-pruned60_quantized?hardware=deepsparse-c6i.12xlarge&comparison=llama2-7b-gsm8k_llama2_pretrain-base&tab=4 and I put it into the downloads/llama2/deployment directory.

And then I built a Docker image with this Dockerfile:

FROM python:3.11.6-slim-bullseye
COPY requirements.txt .
RUN pip install -r requirements.txt
CMD ["ipython"]

requirements.txt

deepsparse-nightly[llm]
ipython
torch

I then did this:

docker build . -t deepsparse:0.0.3
docker run -it -v $PWD/downloads:/tmp deepsparse:0.0.3

And then:

from deepsparse import Pipeline

zoo_stub = "/tmp/llama2/deployment"  
sentiment_analysis_pipeline = Pipeline.create(
  task="sentiment-analysis",    # name of the task
  model_path=f"{zoo_stub}/model.onnx",          # zoo stub or path to local onnx file
)

Output is like this:

DeepSparse, Copyright 2021-present / Neuralmagic, Inc. version: 1.6.0.20231031 COMMUNITY | (1af7b0be) (release) (optimized) (system=neon, binary=neon)
[nm_ort ffff8f9f2d40 >ERROR< init src/libdeepsparse/ort_engine/ort_engine.cpp:538] std exception  Node (/model/Add_1) Op (Add) [ShapeInferenceError] Incompatible dimensions

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[2], line 4
      1 from deepsparse import Pipeline
      3 zoo_stub = "/tmp/llama2/deployment"
----> 4 sentiment_analysis_pipeline = Pipeline.create(
      5   task="sentiment-analysis",    # name of the task
      6   model_path=f"{zoo_stub}/model.onnx",          # zoo stub or path to local onnx file
      7 )

File /usr/local/lib/python3.11/site-packages/deepsparse/base_pipeline.py:210, in BasePipeline.create(task, **kwargs)
    204     buckets = pipeline_constructor.create_pipeline_buckets(
    205         task=task,
    206         **kwargs,
    207     )
    208     return BucketingPipeline(pipelines=buckets)
--> 210 return pipeline_constructor(**kwargs)

File /usr/local/lib/python3.11/site-packages/deepsparse/transformers/pipelines/text_classification.py:152, in TextClassificationPipeline.__init__(self, top_k, return_all_scores, **kwargs)
    145 def __init__(
    146     self,
    147     *,
   (...)
    150     **kwargs,
    151 ):
--> 152     super().__init__(**kwargs)
    154     self._top_k = _get_top_k(top_k, return_all_scores, self.config.num_labels)
    155     self._return_all_scores = return_all_scores

File /usr/local/lib/python3.11/site-packages/deepsparse/transformers/pipelines/pipeline.py:110, in TransformersPipeline.__init__(self, sequence_length, trust_remote_code, config, tokenizer, **kwargs)
    105 self._delay_overwriting_inputs = (
    106     kwargs.pop("_delay_overwriting_inputs", None) or False
    107 )
    108 self._temp_model_directory = None
--> 110 super().__init__(**kwargs)

File /usr/local/lib/python3.11/site-packages/deepsparse/pipeline.py:204, in Pipeline.__init__(self, model_path, engine_type, batch_size, num_cores, num_streams, scheduler, input_shapes, context, executor, benchmark, _delay_engine_initialize, **kwargs)
    202     self.engine = None
    203 else:
--> 204     self.engine = self._initialize_engine()
    205 self._batch_size = self._batch_size or 1
    207 self.log(
    208     identifier=f"{SystemGroups.INFERENCE_DETAILS}/num_cores_total",
    209     value=num_cores,
    210     category=MetricCategories.SYSTEM,
    211 )

File /usr/local/lib/python3.11/site-packages/deepsparse/pipeline.py:530, in Pipeline._initialize_engine(self)
    527 def _initialize_engine(
    528     self,
    529 ) -> Union[Engine, MultiModelEngine, ORTEngine, TorchScriptEngine]:
--> 530     return create_engine(
    531         self.onnx_file_path, self.engine_type, self._engine_args, self.context
    532     )

File /usr/local/lib/python3.11/site-packages/deepsparse/pipeline.py:759, in create_engine(onnx_file_path, engine_type, engine_args, context)
    754         return MultiModelEngine(
    755             model=onnx_file_path,
    756             **engine_args,
    757         )
    758     engine_args.pop("cache_output_bools", None)
--> 759     return Engine(onnx_file_path, **engine_args)
    761 if engine_type == ORT_ENGINE:
    762     return ORTEngine(onnx_file_path, **engine_args)

File /usr/local/lib/python3.11/site-packages/deepsparse/engine.py:327, in Engine.__init__(self, model, batch_size, num_cores, num_streams, scheduler, input_shapes, cached_outputs)
    317         self._eng_net = LIB.deepsparse_engine(
    318             model_path,
    319             engine_batch_size,
   (...)
    324             cached_outputs,
    325         )
    326 else:
--> 327     self._eng_net = LIB.deepsparse_engine(
    328         self._model_path,
    329         engine_batch_size,
    330         self._num_cores,
    331         self._num_streams,
    332         self._scheduler.value,
    333         None,
    334         cached_outputs,
    335     )
    337 if self._batch_size is None:
    338     os.environ.pop("NM_DISABLE_BATCH_OVERRIDE", None)

RuntimeError: NM: error: Node (/model/Add_1) Op (Add) [ShapeInferenceError] Incompatible dimensions

mgoin commented 9 months ago

Hi @mneedham LLMs for text generation like Llama only support running in a "text-generation" pipeline, so please use that task name instead of sentiment-analysis. You can also use a TextGeneration object directly, see the documentation here https://github.com/neuralmagic/deepsparse/blob/main/docs/llms/text-generation-pipeline.md

mneedham commented 9 months ago

Ah ok. I tried it like this:

docker run -it -v $PWD/downloads:/tmp deepsparse:0.0.3

This is what's in the downloads/llama2 directory:

$ tree downloads/llama2
downloads/llama2
├── deployment
│   ├── config.json
│   ├── model.data
│   ├── model.onnx
│   ├── special_tokens_map.json
│   ├── tokenizer.json
│   └── tokenizer_config.json
├── deployment.tar.gz
└── downloads

And then I run it:

from deepsparse import TextGeneration

zoo_stub = "/tmp/llama2/deployment"  
pipeline = TextGeneration(model=zoo_stub)

2023-11-25 17:49:18 deepsparse.transformers.pipelines.text_generation INFO     Compiling an auxiliary engine to process a prompt with a larger processing length. This improves performance, but may result in additional memory consumption.
[nm_ort ffff987fed40 >ERROR< init src/libdeepsparse/ort_engine/ort_engine.cpp:538] std exception  Node (concat.past_key_values.0.value_transposed) Op (Concat) [ShapeInferenceError] Can't merge shape info. Both source and target dimension have values but they differ. Source=32 Target=128 Dimension=2

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[4], line 4
      1 from deepsparse import TextGeneration
      3 zoo_stub = "/tmp/llama2/deployment"
----> 4 pipeline = TextGeneration(model=zoo_stub)

File /usr/local/lib/python3.11/site-packages/deepsparse/pipeline.py:814, in text_generation_pipeline(model, *args, **kwargs)
    809 """
    810 :return: text generation pipeline with the given args and
    811     kwargs passed to Pipeline.create
    812 """
    813 kwargs = _parse_model_arg(model, **kwargs)
--> 814 return Pipeline.create("text_generation", *args, **kwargs)

File /usr/local/lib/python3.11/site-packages/deepsparse/base_pipeline.py:210, in BasePipeline.create(task, **kwargs)
    204     buckets = pipeline_constructor.create_pipeline_buckets(
    205         task=task,
    206         **kwargs,
    207     )
    208     return BucketingPipeline(pipelines=buckets)
--> 210 return pipeline_constructor(**kwargs)

File /usr/local/lib/python3.11/site-packages/deepsparse/transformers/pipelines/text_generation.py:281, in TextGenerationPipeline.__init__(self, sequence_length, prompt_sequence_length, force_max_tokens, internal_kv_cache, generation_config, **kwargs)
    278 if not self.tokenizer.pad_token:
    279     self.tokenizer.pad_token = self.tokenizer.eos_token
--> 281 self.engine, self.multitoken_engine = self.initialize_engines()
    283 # auxiliary flag for devs to enable debug mode for the pipeline
    284 self._debug = False

File /usr/local/lib/python3.11/site-packages/deepsparse/transformers/pipelines/text_generation.py:361, in TextGenerationPipeline.initialize_engines(self)
    346 if (
    347     self.cache_support_enabled and self.enable_multitoken_prefill
    348 ) or not self.cache_support_enabled:
   (...)
    353     #   (the prompt is processed in a single pass, prompts length is fixed at
    354     #   sequence_length)
    355     input_ids_length = (
    356         self.prompt_sequence_length
    357         if self.cache_support_enabled
    358         else self.sequence_length
    359     )
--> 361     multitoken_engine = NLDecoderEngine(
    362         onnx_file_path=self.onnx_file_path,
    363         engine_type=self.engine_type,
    364         engine_args=self.engine_args,
    365         engine_context=self.context,
    366         sequence_length=self.sequence_length,
    367         input_ids_length=input_ids_length,
    368         internal_kv_cache=self.internal_kv_cache,
    369         timer_manager=self.timer_manager,
    370     )
    372 if self.cache_support_enabled:
    373     engine = NLDecoderEngine(
    374         onnx_file_path=self.onnx_file_path,
    375         engine_type=self.engine_type,
   (...)
    381         timer_manager=self.timer_manager,
    382     )

File /usr/local/lib/python3.11/site-packages/deepsparse/transformers/engines/nl_decoder_engine.py:82, in NLDecoderEngine.__init__(self, onnx_file_path, engine_type, engine_args, sequence_length, input_ids_length, engine_context, internal_kv_cache, timer_manager)
     78     if internal_kv_cache and engine_type == DEEPSPARSE_ENGINE:
     79         # inform the engine, that are using the kv cache
     80         engine_args["cached_outputs"] = output_indices_to_be_cached
---> 82 self.engine = create_engine(
     83     onnx_file_path=onnx_file_path,
     84     engine_type=engine_type,
     85     engine_args=engine_args,
     86     context=engine_context,
     87 )
     88 self.timer_manager = timer_manager or TimerManager()
     89 self.sequence_length = sequence_length

File /usr/local/lib/python3.11/site-packages/deepsparse/pipeline.py:759, in create_engine(onnx_file_path, engine_type, engine_args, context)
    754         return MultiModelEngine(
    755             model=onnx_file_path,
    756             **engine_args,
    757         )
    758     engine_args.pop("cache_output_bools", None)
--> 759     return Engine(onnx_file_path, **engine_args)
    761 if engine_type == ORT_ENGINE:
    762     return ORTEngine(onnx_file_path, **engine_args)

File /usr/local/lib/python3.11/site-packages/deepsparse/engine.py:327, in Engine.__init__(self, model, batch_size, num_cores, num_streams, scheduler, input_shapes, cached_outputs)
    317         self._eng_net = LIB.deepsparse_engine(
    318             model_path,
    319             engine_batch_size,
   (...)
    324             cached_outputs,
    325         )
    326 else:
--> 327     self._eng_net = LIB.deepsparse_engine(
    328         self._model_path,
    329         engine_batch_size,
    330         self._num_cores,
    331         self._num_streams,
    332         self._scheduler.value,
    333         None,
    334         cached_outputs,
    335     )
    337 if self._batch_size is None:
    338     os.environ.pop("NM_DISABLE_BATCH_OVERRIDE", None)

RuntimeError: NM: error: Node (concat.past_key_values.0.value_transposed) Op (Concat) [ShapeInferenceError] Can't merge shape info. Both source and target dimension have values but they differ. Source=32 Target=128 Dimension=2

dbogunowicz commented 9 months ago

Hey @mneedham

I have difficulty reproducing your error. Let's try to get a minimal working example working. I am afraid that once the llama2 model has been downloaded to your disk, you may have unintentionally modified it when running previous, incorrect commands.

Setup

Spin up your docker as you did before. I think that when it comes to container initialization you are doing everything correctly Make sure that ROOT/.cache/sparsezoo/ is empty (there is not lingering, potentially corrupted, llama2 model in your cache).

For completeness, my setup is: ubuntu-20.04 deepsparse-nightly (fresh pip install -U deepsparse-nightly[llm]) python 3.10

Run minimal example

Now enter your docker container and execute:

from deepsparse import TextGeneration
model_path = "zoo:llama2-7b-gsm8k_llama2_pretrain-pruned60_quantized"
pipeline = TextGeneration(model=model_path)
generations = pipeline(prompt="Who is the president of the United States?")
print(generations)

You should see this output.

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
Downloading (…)ed/deployment.tar.gz: 100%|██████████| 3.92G/3.92G [05:44<00:00, 12.2MB/s]
2023-11-30 12:22:55 deepsparse.transformers.pipelines.text_generation INFO     Compiling an auxiliary engine to process a prompt with a larger processing length. This improves performance, but may result in additional memory consumption.
DeepSparse, Copyright 2021-present / Neuralmagic, Inc. version: 1.6.0.20231128 COMMUNITY | (46baca65) (release) (optimized) (system=avx2, binary=avx2)
[7fbcf691a700 >WARN<  operator() ./src/include/wand/utility/warnings.hpp:14] Generating emulated code for quantized (INT8) operations since no VNNI instructions were detected. Set NM_FAST_VNNI_EMULATION=1 to increase performance at the expense of accuracy.
created=datetime.datetime(2023, 11, 30, 12, 23, 50, 707904) prompts='Who is the president of the United States?' generations=[GeneratedText(text='The president of the United States is the person who is the most senior in the chain of command.\nThe chain of command is the set of people who are in charge of the different parts of the government.\nThe president is the most senior in the chain of command, so he is the 1st in the chain of command.\n#### 1', score=None, finished=True, finished_reason='stop')] input_tokens=None

Could you try following these instructions?

mneedham commented 9 months ago

Hey @dbogunowicz,

Sorry for the delayed reply - I only just saw your reply now! The example that you provided works great, thanks!

In [8]: generations = pipeline(prompt="Who is the president of the United States?", streaming=True)

In [9]: %%time
   ...: for it in generations:
   ...:     print(it.generations[0].text, end=" ")
   ...:
<s> The president of the United States is the head of the executive branch of the government .
 The president is also the head of the government .
 The president is the head of the government and the head of the executive branch , so the president is also the head of the whole government .
 ####  1 </s> CPU times: user 48.1 s, sys: 17.5 ms, total: 48.1 s
Wall time: 8.19 s

dbogunowicz commented 9 months ago

Great to hear that @mneedham!

I will close this issue, as it is resolved. I hope that you will have fun working with NM products. If you happen to come across any problems, feel free to reach out to us!

neuralmagic / deepsparse

NM: error: Node (/model/Add_1) Op (Add) [ShapeInferenceError] Incompatible dimension #1430

Setup

Run minimal example