Closed mneedham closed 9 months ago
Hi @mneedham LLMs for text generation like Llama only support running in a "text-generation" pipeline, so please use that task name instead of sentiment-analysis. You can also use a TextGeneration object directly, see the documentation here https://github.com/neuralmagic/deepsparse/blob/main/docs/llms/text-generation-pipeline.md
Ah ok. I tried it like this:
docker run -it -v $PWD/downloads:/tmp deepsparse:0.0.3
This is what's in the downloads/llama2
directory:
$ tree downloads/llama2
downloads/llama2
├── deployment
│ ├── config.json
│ ├── model.data
│ ├── model.onnx
│ ├── special_tokens_map.json
│ ├── tokenizer.json
│ └── tokenizer_config.json
├── deployment.tar.gz
└── downloads
And then I run it:
from deepsparse import TextGeneration
zoo_stub = "/tmp/llama2/deployment"
pipeline = TextGeneration(model=zoo_stub)
2023-11-25 17:49:18 deepsparse.transformers.pipelines.text_generation INFO Compiling an auxiliary engine to process a prompt with a larger processing length. This improves performance, but may result in additional memory consumption.
[nm_ort ffff987fed40 >ERROR< init src/libdeepsparse/ort_engine/ort_engine.cpp:538] std exception Node (concat.past_key_values.0.value_transposed) Op (Concat) [ShapeInferenceError] Can't merge shape info. Both source and target dimension have values but they differ. Source=32 Target=128 Dimension=2
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In[4], line 4
1 from deepsparse import TextGeneration
3 zoo_stub = "/tmp/llama2/deployment"
----> 4 pipeline = TextGeneration(model=zoo_stub)
File /usr/local/lib/python3.11/site-packages/deepsparse/pipeline.py:814, in text_generation_pipeline(model, *args, **kwargs)
809 """
810 :return: text generation pipeline with the given args and
811 kwargs passed to Pipeline.create
812 """
813 kwargs = _parse_model_arg(model, **kwargs)
--> 814 return Pipeline.create("text_generation", *args, **kwargs)
File /usr/local/lib/python3.11/site-packages/deepsparse/base_pipeline.py:210, in BasePipeline.create(task, **kwargs)
204 buckets = pipeline_constructor.create_pipeline_buckets(
205 task=task,
206 **kwargs,
207 )
208 return BucketingPipeline(pipelines=buckets)
--> 210 return pipeline_constructor(**kwargs)
File /usr/local/lib/python3.11/site-packages/deepsparse/transformers/pipelines/text_generation.py:281, in TextGenerationPipeline.__init__(self, sequence_length, prompt_sequence_length, force_max_tokens, internal_kv_cache, generation_config, **kwargs)
278 if not self.tokenizer.pad_token:
279 self.tokenizer.pad_token = self.tokenizer.eos_token
--> 281 self.engine, self.multitoken_engine = self.initialize_engines()
283 # auxiliary flag for devs to enable debug mode for the pipeline
284 self._debug = False
File /usr/local/lib/python3.11/site-packages/deepsparse/transformers/pipelines/text_generation.py:361, in TextGenerationPipeline.initialize_engines(self)
346 if (
347 self.cache_support_enabled and self.enable_multitoken_prefill
348 ) or not self.cache_support_enabled:
(...)
353 # (the prompt is processed in a single pass, prompts length is fixed at
354 # sequence_length)
355 input_ids_length = (
356 self.prompt_sequence_length
357 if self.cache_support_enabled
358 else self.sequence_length
359 )
--> 361 multitoken_engine = NLDecoderEngine(
362 onnx_file_path=self.onnx_file_path,
363 engine_type=self.engine_type,
364 engine_args=self.engine_args,
365 engine_context=self.context,
366 sequence_length=self.sequence_length,
367 input_ids_length=input_ids_length,
368 internal_kv_cache=self.internal_kv_cache,
369 timer_manager=self.timer_manager,
370 )
372 if self.cache_support_enabled:
373 engine = NLDecoderEngine(
374 onnx_file_path=self.onnx_file_path,
375 engine_type=self.engine_type,
(...)
381 timer_manager=self.timer_manager,
382 )
File /usr/local/lib/python3.11/site-packages/deepsparse/transformers/engines/nl_decoder_engine.py:82, in NLDecoderEngine.__init__(self, onnx_file_path, engine_type, engine_args, sequence_length, input_ids_length, engine_context, internal_kv_cache, timer_manager)
78 if internal_kv_cache and engine_type == DEEPSPARSE_ENGINE:
79 # inform the engine, that are using the kv cache
80 engine_args["cached_outputs"] = output_indices_to_be_cached
---> 82 self.engine = create_engine(
83 onnx_file_path=onnx_file_path,
84 engine_type=engine_type,
85 engine_args=engine_args,
86 context=engine_context,
87 )
88 self.timer_manager = timer_manager or TimerManager()
89 self.sequence_length = sequence_length
File /usr/local/lib/python3.11/site-packages/deepsparse/pipeline.py:759, in create_engine(onnx_file_path, engine_type, engine_args, context)
754 return MultiModelEngine(
755 model=onnx_file_path,
756 **engine_args,
757 )
758 engine_args.pop("cache_output_bools", None)
--> 759 return Engine(onnx_file_path, **engine_args)
761 if engine_type == ORT_ENGINE:
762 return ORTEngine(onnx_file_path, **engine_args)
File /usr/local/lib/python3.11/site-packages/deepsparse/engine.py:327, in Engine.__init__(self, model, batch_size, num_cores, num_streams, scheduler, input_shapes, cached_outputs)
317 self._eng_net = LIB.deepsparse_engine(
318 model_path,
319 engine_batch_size,
(...)
324 cached_outputs,
325 )
326 else:
--> 327 self._eng_net = LIB.deepsparse_engine(
328 self._model_path,
329 engine_batch_size,
330 self._num_cores,
331 self._num_streams,
332 self._scheduler.value,
333 None,
334 cached_outputs,
335 )
337 if self._batch_size is None:
338 os.environ.pop("NM_DISABLE_BATCH_OVERRIDE", None)
RuntimeError: NM: error: Node (concat.past_key_values.0.value_transposed) Op (Concat) [ShapeInferenceError] Can't merge shape info. Both source and target dimension have values but they differ. Source=32 Target=128 Dimension=2
Hey @mneedham
I have difficulty reproducing your error. Let's try to get a minimal working example working. I am afraid that once the llama2 model has been downloaded to your disk, you may have unintentionally modified it when running previous, incorrect commands.
Spin up your docker as you did before. I think that when it comes to container initialization you are doing everything correctly Make sure that ROOT/.cache/sparsezoo/
is empty (there is not lingering, potentially corrupted, llama2 model in your cache).
For completeness, my setup is:
ubuntu-20.04
deepsparse-nightly
(fresh pip install -U deepsparse-nightly[llm]
)
python 3.10
Now enter your docker container and execute:
from deepsparse import TextGeneration
model_path = "zoo:llama2-7b-gsm8k_llama2_pretrain-pruned60_quantized"
pipeline = TextGeneration(model=model_path)
generations = pipeline(prompt="Who is the president of the United States?")
print(generations)
You should see this output.
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
Downloading (…)ed/deployment.tar.gz: 100%|██████████| 3.92G/3.92G [05:44<00:00, 12.2MB/s]
2023-11-30 12:22:55 deepsparse.transformers.pipelines.text_generation INFO Compiling an auxiliary engine to process a prompt with a larger processing length. This improves performance, but may result in additional memory consumption.
DeepSparse, Copyright 2021-present / Neuralmagic, Inc. version: 1.6.0.20231128 COMMUNITY | (46baca65) (release) (optimized) (system=avx2, binary=avx2)
[7fbcf691a700 >WARN< operator() ./src/include/wand/utility/warnings.hpp:14] Generating emulated code for quantized (INT8) operations since no VNNI instructions were detected. Set NM_FAST_VNNI_EMULATION=1 to increase performance at the expense of accuracy.
created=datetime.datetime(2023, 11, 30, 12, 23, 50, 707904) prompts='Who is the president of the United States?' generations=[GeneratedText(text='The president of the United States is the person who is the most senior in the chain of command.\nThe chain of command is the set of people who are in charge of the different parts of the government.\nThe president is the most senior in the chain of command, so he is the 1st in the chain of command.\n#### 1', score=None, finished=True, finished_reason='stop')] input_tokens=None
Could you try following these instructions?
Hey @dbogunowicz,
Sorry for the delayed reply - I only just saw your reply now! The example that you provided works great, thanks!
In [8]: generations = pipeline(prompt="Who is the president of the United States?", streaming=True)
In [9]: %%time
...: for it in generations:
...: print(it.generations[0].text, end=" ")
...:
<s> The president of the United States is the head of the executive branch of the government .
The president is also the head of the government .
The president is the head of the government and the head of the executive branch , so the president is also the head of the whole government .
#### 1 </s> CPU times: user 48.1 s, sys: 17.5 ms, total: 48.1 s
Wall time: 8.19 s
Great to hear that @mneedham!
I will close this issue, as it is resolved. I hope that you will have fun working with NM products. If you happen to come across any problems, feel free to reach out to us!
Hi,
I wanted to try out the Llama2 model that you recently published, but I can't get it working. I'm using Docker on a Mac M1. So I downloaded the deployment.tar.gz file from this page - https://sparsezoo.neuralmagic.com/models/llama2-7b-gsm8k_llama2_pretrain-pruned60_quantized?hardware=deepsparse-c6i.12xlarge&comparison=llama2-7b-gsm8k_llama2_pretrain-base&tab=4 and I put it into the
downloads/llama2/deployment
directory.And then I built a Docker image with this Dockerfile:
requirements.txt
I then did this:
And then:
Output is like this: