huggingface / optimum-neuron

Easy, fast and very cheap training and inference on AWS Trainium and Inferentia chips.
Apache License 2.0
195 stars 59 forks source link

nrt_load_collectives: Unsupported topology. Supported number of Neuron Cores is 1, 2, 8, 16 or a multiple of 32. For other configurations, such as multiples of 2, consider using inf2 instances #619

Closed oemd001 closed 3 months ago

oemd001 commented 4 months ago

System Info

absl-py==2.1.0
accelerate==0.29.2
aiohttp==3.9.5
aiosignal==1.3.1
anyio==4.4.0
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
arrow==1.3.0
asttokens==2.4.1
async-lru==2.0.4
async-timeout==4.0.3
attrs==23.2.0
aws-neuronx-runtime-discovery==2.9
awscli==1.32.117
Babel==2.15.0
beautifulsoup4==4.12.3
bleach==6.1.0
boto3==1.34.117
botocore==1.34.117
build==1.2.1
cachetools==5.3.3
certifi==2024.2.2
cffi==1.16.0
charset-normalizer==3.3.2
cloud-tpu-client==0.10
colorama==0.4.6
coloredlogs==15.0.1
comm==0.2.2
datasets==2.19.1
debugpy==1.8.1
decorator==5.1.1
defusedxml==0.7.1
dill==0.3.8
docutils==0.16
ec2-metadata==2.10.0
environment-kernels==1.2.0
exceptiongroup==1.2.1
executing==2.0.1
fastjsonschema==2.19.1
filelock==3.14.0
fqdn==1.5.1
frozenlist==1.4.1
fsspec==2024.3.1
google-api-core==1.34.1
google-api-python-client==1.8.0
google-auth==2.29.0
google-auth-httplib2==0.2.0
googleapis-common-protos==1.63.0
h11==0.14.0
httpcore==1.0.5
httplib2==0.22.0
httpx==0.27.0
huggingface-hub==0.23.2
humanfriendly==10.0
idna==3.7
ipykernel==6.29.4
ipython==8.25.0
ipywidgets==8.1.3
islpy==2023.1
isoduration==20.11.0
jedi==0.19.1
Jinja2==3.1.4
jmespath==1.0.1
json5==0.9.25
jsonpointer==2.4
jsonschema==4.22.0
jsonschema-specifications==2023.12.1
jupyter==1.0.0
jupyter-console==6.6.3
jupyter-events==0.10.0
jupyter-lsp==2.2.5
jupyter_client==8.6.2
jupyter_core==5.7.2
jupyter_server==2.14.1
jupyter_server_terminals==0.5.3
jupyterlab==4.2.1
jupyterlab_pygments==0.3.0
jupyterlab_server==2.27.2
jupyterlab_widgets==3.0.11
libneuronxla==2.0.965
lockfile==0.12.2
MarkupSafe==2.1.5
matplotlib-inline==0.1.7
mistune==3.0.2
mpmath==1.3.0
multidict==6.0.5
multiprocess==0.70.16
nbclient==0.10.0
nbconvert==7.16.4
nbformat==5.10.4
nest-asyncio==1.6.0
networkx==2.6.3
neuronx-cc==2.13.72.0+78a426937
neuronx-distributed==0.7.0
notebook==7.2.0
notebook_shim==0.2.4
numpy==1.25.2
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.18.1
nvidia-nvjitlink-cu12==12.5.40
nvidia-nvtx-cu12==12.1.105
oauth2client==4.1.3
optimum==1.20.0
optimum-neuron==0.0.23
overrides==7.7.0
packaging==24.0
pandas==2.2.2
pandocfilters==1.5.1
parso==0.8.4
pexpect==4.9.0
pgzip==0.3.5
pillow==10.3.0
platformdirs==4.2.2
prometheus_client==0.20.0
prompt_toolkit==3.0.45
protobuf==3.19.6
psutil==5.9.8
ptyprocess==0.7.0
pure-eval==0.2.2
pyarrow==16.1.0
pyarrow-hotfix==0.6
pyasn1==0.6.0
pyasn1_modules==0.4.0
pycparser==2.22
Pygments==2.18.0
pyparsing==3.1.2
pyproject_hooks==1.1.0
python-daemon==3.0.1
python-dateutil==2.9.0.post0
python-json-logger==2.0.7
pytz==2024.1
PyYAML==6.0.1
pyzmq==26.0.3
qtconsole==5.5.2
QtPy==2.4.1
referencing==0.35.1
regex==2024.5.15
requests==2.32.3
requests-unixsocket==0.3.0
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
rpds-py==0.18.1
rsa==4.7.2
s3transfer==0.10.1
safetensors==0.4.3
scipy==1.11.2
Send2Trash==1.8.3
sentencepiece==0.2.0
six==1.16.0
sniffio==1.3.1
soupsieve==2.5
stack-data==0.6.3
sympy==1.12.1
terminado==0.18.1
tinycss2==1.3.0
tokenizers==0.19.1
tomli==2.0.1
torch==2.1.2
torch-neuronx==2.1.2.2.1.0
torch-xla==2.1.2
torchvision==0.16.2
tornado==6.4
tqdm==4.66.4
traitlets==5.14.3
transformers==4.41.1
transformers-neuronx==0.10.0.21
triton==2.1.0
types-python-dateutil==2.9.0.20240316
typing_extensions==4.12.0
tzdata==2024.1
uri-template==1.3.0
uritemplate==3.0.1
urllib3==2.2.1
wcwidth==0.2.13
webcolors==1.13
webencodings==0.5.1
websocket-client==1.8.0
wget==3.2
widgetsnbextension==4.0.11
xxhash==3.4.1
yarl==1.9.4

Who can help?

@dacorvo

Information

Tasks

Reproduction (minimal, reproducible, runnable)

Export the Meta-Llama-70b-Instruct model using the following command:

optimum-cli export neuron --model meta-llama/Meta-Llama-3-70B --batch_size 1 --sequence_length 8192 --auto_cast_type fp16# cast operations from BF16 to FP16--num_cores 16 llama3_70b_neuron/

Attempt to run it with docker

docker run -p 8080:80 -v /home/ubuntu/llama3_70b_neuron:/data --privileged ghcr.io/huggingface/neuronx-tgi:latest --model-id /data/checkpoint

Errors:

2024-Jun-01 13:16:27.7059542024-Jun-01 13:16:27.705593 273:487 ERROR NRT:nrt_load_collectives Unsupported topology. Supported number of Neuron Cores is 1, 2, 8, 16 or a multiple of 32. For other configurations, such as multiples of 2, consider using inf2 instances 273:489 ERROR NRT:nrt_load_collectives Unsupported topology. Supported number of Neuron Cores is 1, 2, 8, 16 or a multiple of 32. For other configurations, such as multiples of 2, consider using inf2 instances

Full error report:

Traceback (most recent call last): File "/usr/local/bin/text-generation-server", line 8, in sys.exit(app()) File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 311, in call return get_command(self)(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in call return self.main(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 778, in main return _main( File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 216, in _main rv = self.invoke(ctx) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, ctx.params) File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke return __callback(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 683, in wrapper return callback(*use_params) # type: ignore File "/usr/local/lib/python3.10/dist-packages/text_generation_server/cli.py", line 60, in serve serve(model_id, revision, uds_path) File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 89, in serve asyncio.run(serve_inner(model_id, revision)) File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "/usr/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete self.run_forever() File "/usr/lib/python3.10/asyncio/base_events.py", line 603, in run_forever self._run_once() File "/usr/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once handle._run() File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run self._context.run(self._callback, self._args)

File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 63, in serve_inner generator = NeuronGenerator.from_pretrained(model_id, revision) File "/usr/local/lib/python3.10/dist-packages/text_generation_server/generator.py", line 589, in from_pretrained model = NeuronModelForCausalLM.from_pretrained( File "/usr/local/lib/python3.10/dist-packages/optimum/modeling_base.py", line 402, in from_pretrained return from_pretrained_method( File "/usr/local/lib/python3.10/dist-packages/optimum/neuron/utils/require_utils.py", line 50, in wrapper return func(*args, kwargs) File "/usr/local/lib/python3.10/dist-packages/optimum/neuron/modeling_decoder.py", line 324, in _from_transformers return cls._export(*args, *kwargs) File "/usr/local/lib/python3.10/dist-packages/optimum/neuron/utils/require_utils.py", line 50, in wrapper return func(args, kwargs) File "/usr/local/lib/python3.10/dist-packages/optimum/neuron/modeling_decoder.py", line 372, in _export return cls(new_config, checkpoint_dir, generation_config=generation_config) File "/usr/local/lib/python3.10/dist-packages/optimum/neuron/modeling.py", line 671, in init super().init(config, checkpoint_dir, compiled_dir=compiled_dir, generation_config=generation_config) File "/usr/local/lib/python3.10/dist-packages/optimum/neuron/utils/require_utils.py", line 50, in wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/optimum/neuron/modeling_decoder.py", line 215, in init neuronx_model.to_neuron() File "/usr/local/lib/python3.10/dist-packages/transformers_neuronx/base.py", line 73, in to_neuron self.setup() File "/usr/local/lib/python3.10/dist-packages/transformers_neuronx/base.py", line 64, in setup nbs.setup() File "/usr/local/lib/python3.10/dist-packages/transformers_neuronx/decoder.py", line 384, in setup self.program.setup(self.layers, self.pre_layer_parameters, self.ln_lm_head_params) File "/usr/local/lib/python3.10/dist-packages/transformers_neuronx/decoder.py", line 1613, in setup super().setup(layers, pre_layer_params, ln_lm_head_params) File "/usr/local/lib/python3.10/dist-packages/transformers_neuronx/decoder.py", line 1472, in setup kernel.load(io_ring_cache_size) File "/usr/local/lib/python3.10/dist-packages/transformers_neuronx/compiler.py", line 476, in load self.model.load() RuntimeError: nrt_load_collectives status=2 message="Invalid"

2024-06-01T13:16:28.721458Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2024-06-01T13:16:38.729466Z INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0 2024-06-01T13:16:40.431236Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

/usr/local/lib/python3.10/dist-packages/transformers_neuronx/decoder.py:154: UserWarning: KV head replication will be enabled since the number of KV heads (8) is not evenly divisible by the tensor parallel degree (24) warnings.warn( 2024-Jun-01 13:16:27.705412 273:467 ERROR NRT:nrt_load_collectives Unsupported topology. Supported number of Neuron Cores is 1, 2, 8, 16 or a multiple of 32. For other configurations, such as multiples of 2, consider using inf2 instances 2024-Jun-01 13:16:27.705488 273:480 ERROR NRT:nrt_load_collectives Unsupported topology. Supported number of Neuron Cores is 1, 2, 8, 16 or a multiple of 32. For other configurations, such as multiples of 2, consider using inf2 instances 2024-Jun-01 13:16:27.705521 273:480 ERROR NRT:nrt_load_collectives Unsupported topology. Supported number of Neuron Cores is 1, 2, 8, 16 or a multiple of 32. For other configurations, such as multiples of 2, consider using inf2 instances 2024-Jun-01 13:16:27.7054942024-Jun-01 13:16:27.705527 273:480 ERROR NRT:nrt_load_collectives Unsupported topology. Supported number of Neuron Cores is 1, 2, 8, 16 or a multiple of 32. For other configurations, such as multiples of 2, consider using inf2 instances 273:467 ERROR NRT:nrt_load_collectives Unsupported topology. Supported number of Neuron Cores is 1, 2, 8, 16 or a multiple of 32. For other configurations, such as multiples of 2, consider using inf2 instances 2024-Jun-01 13:16:27.705533 273:480 ERROR NRT:nrt_load_collectives Unsupported topology. Supported number of Neuron Cores is 1, 2, 8, 16 or a multiple of 32. For other configurations, such as multiples of 2, consider using inf2 instances 2024-Jun-01 13:16:27.705554 273:467 ERROR NRT:nrt_load_collectives Unsupported topology. Supported number of Neuron Cores is 1, 2, 8, 16 or a multiple of 32. For other configurations, such as multiples of 2, consider using inf2 instances 2024-Jun-01 13:16:27.705563 273:480 ERROR NRT:nrt_load_collectives Unsupported topology. Supported number of Neuron Cores is 1, 2, 8, 16 or a multiple of 32. For other configurations, such as multiples of 2, consider using inf2 instances 2024-Jun-01 13:16:27.705577 273:480 ERROR NRT:nrt_load_collectives Unsupported topology. Supported number of Neuron Cores is 1, 2, 8, 16 or a multiple of 32. For other configurations, such as multiples of 2, consider using inf2 instances 2024-Jun-01 13:16:27.705575 273:467 ERROR NRT:nrt_load_collectives Unsupported topology. Supported number of Neuron Cores is 1, 2, 8, 16 or a multiple of 32. For other configurations, such as multiples of 2, consider using inf2 instances 2024-Jun-01 13:16:27.705582 273:480 ERROR NRT:nrt_load_collectives Unsupported topology. Supported number of Neuron Cores is 1, 2, 8, 16 or a multiple of 32. For other configurations, such as multiples of 2, consider using inf2 instances 2024-Jun-01 13:16:27.705599 273:480 ERROR NRT:nrt_load_collectives Unsupported topology. Supported number of Neuron Cores is 1, 2, 8, 16 or a multiple of 32. For other configurations, such as multiples of 2, consider using inf2 instances 2024-Jun-01 13:16:27.705595 273:467 ERROR NRT:nrt_load_collectives Unsupported topology. Supported number of Neuron Cores is 1, 2, 8, 16 or a multiple of 32. For other configurations, such as multiples of 2, consider using inf2 instances 2024-Jun-01 13:16:27.705656 273:467 ERROR NRT:nrt_load_collectives Unsupported topology. Supported number of Neuron Cores is 1, 2, 8, 16 or a multiple of 32. For other configurations, such as multiples of 2, consider using inf2 instances 2024-Jun-01 13:16:27.705672 273:467 ERROR NRT:nrt_load_collectives Unsupported topology. Supported number of Neuron Cores is 1, 2, 8, 16 or a multiple of 32. For other configurations, such as multiples of 2, consider using inf2 instances 2024-Jun-01 13:16:27.705847 273:487 ERROR NRT:nrt_load_collectives Unsupported topology. Supported number of Neuron Cores is 1, 2, 8, 16 or a multiple of 32. For other configurations, such as multiples of 2, consider using inf2 instances 2024-Jun-01 13:16:27.7059542024-Jun-01 13:16:27.705593 273:487 ERROR NRT:nrt_load_collectives Unsupported topology. Supported number of Neuron Cores is 1, 2, 8, 16 or a multiple of 32. For other configurations, such as multiples of 2, consider using inf2 instances 273:489 ERROR NRT:nrt_load_collectives Unsupported topology. Supported number of Neuron Cores is 1, 2, 8, 16 or a multiple of 32. For other configurations, such as multiples of 2, consider using inf2 instances

2024-Jun-01 13:16:27.706192 273:485 ERROR NRT:nrt_load_collectives Unsupported topology. Supported number of Neuron Cores is 1, 2, 8, 16 or a multiple of 32. For other configurations, such as multiples of 2, consider using inf2 instances 2024-Jun-01 13:16:27.706262 273:485 ERROR NRT:nrt_load_collectives Unsupported topology. Supported number of Neuron Cores is 1, 2, 8, 16 or a multiple of 32. For other configurations, such as multiples of 2, consider using inf2 instances 2024-Jun-01 13:16:27.706399 273:482 ERROR NRT:nrt_load_collectives Unsupported topology. Supported number of Neuron Cores is 1, 2, 8, 16 or a multiple of 32. For other configurations, such as multiples of 2, consider using inf2 instances 2024-Jun-01 13:16:27.706470 273:482 ERROR NRT:nrt_load_collectives Unsupported topology. Supported number of Neuron Cores is 1, 2, 8, 16 or a multiple of 32. For other configurations, such as multiples of 2, consider using inf2 instances 2024-Jun-01 13:16:27.705185 273:484 ERROR NRT:nrt_load_collectives Unsupported topology. Supported number of Neuron Cores is 1, 2, 8, 16 or a multiple of 32. For other configurations, such as multiples of 2, consider using inf2 instances 2024-Jun-01 13:16:27.705248 273:466 ERROR NRT:nrt_load_collectives Unsupported topology. Supported number of Neuron Cores is 1, 2, 8, 16 or a multiple of 32. For other configurations, such as multiples of 2, consider using inf2 instances Traceback (most recent call last):

File "/usr/local/bin/text-generation-server", line 8, in sys.exit(app())

File "/usr/local/lib/python3.10/dist-packages/text_generation_server/cli.py", line 60, in serve serve(model_id, revision, uds_path)

File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 89, in serve asyncio.run(serve_inner(model_id, revision))

File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run return loop.run_until_complete(main)

File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete return future.result()

File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 63, in serve_inner generator = NeuronGenerator.from_pretrained(model_id, revision)

File "/usr/local/lib/python3.10/dist-packages/text_generation_server/generator.py", line 589, in from_pretrained model = NeuronModelForCausalLM.from_pretrained(

File "/usr/local/lib/python3.10/dist-packages/optimum/modeling_base.py", line 402, in from_pretrained return from_pretrained_method(

File "/usr/local/lib/python3.10/dist-packages/optimum/neuron/utils/require_utils.py", line 50, in wrapper return func(*args, **kwargs)

File "/usr/local/lib/python3.10/dist-packages/optimum/neuron/modeling_decoder.py", line 324, in _from_transformers return cls._export(*args, **kwargs)

File "/usr/local/lib/python3.10/dist-packages/optimum/neuron/utils/require_utils.py", line 50, in wrapper return func(*args, **kwargs)

File "/usr/local/lib/python3.10/dist-packages/optimum/neuron/modeling_decoder.py", line 372, in _export return cls(new_config, checkpoint_dir, generation_config=generation_config)

File "/usr/local/lib/python3.10/dist-packages/optimum/neuron/modeling.py", line 671, in init super().init(config, checkpoint_dir, compiled_dir=compiled_dir, generation_config=generation_config)

File "/usr/local/lib/python3.10/dist-packages/optimum/neuron/utils/require_utils.py", line 50, in wrapper return func(*args, **kwargs)

File "/usr/local/lib/python3.10/dist-packages/optimum/neuron/modeling_decoder.py", line 215, in init neuronx_model.to_neuron()

File "/usr/local/lib/python3.10/dist-packages/transformers_neuronx/base.py", line 73, in to_neuron self.setup()

File "/usr/local/lib/python3.10/dist-packages/transformers_neuronx/base.py", line 64, in setup nbs.setup()

File "/usr/local/lib/python3.10/dist-packages/transformers_neuronx/decoder.py", line 384, in setup self.program.setup(self.layers, self.pre_layer_parameters, self.ln_lm_head_params)

File "/usr/local/lib/python3.10/dist-packages/transformers_neuronx/decoder.py", line 1613, in setup super().setup(layers, pre_layer_params, ln_lm_head_params)

File "/usr/local/lib/python3.10/dist-packages/transformers_neuronx/decoder.py", line 1472, in setup kernel.load(io_ring_cache_size)

File "/usr/local/lib/python3.10/dist-packages/transformers_neuronx/compiler.py", line 476, in load self.model.load()

RuntimeError: nrt_load_collectives status=2 message="Invalid" rank=0 2024-06-01T13:16:40.475715Z ERROR text_generation_launcher: Shard 0 failed to start 2024-06-01T13:16:40.475724Z INFO text_generation_launcher: Shutting down shards Error: ShardCannotStart

Expected behavior

Like the TGI on Huggingface, the expected behavior is an endpoint up and running.

oemd001 commented 4 months ago

Sorry for the code/error/log vomit, but this is definitely a head scratcher. I'm unsure if it's because I compiled neuron incorrectly, or with the incorrect parameters, but I'm at a loss as to how to proceed with the error above.

I am running an trn1.32xlarge instance.

jimburtoft commented 4 months ago

@oemd001 on a trn1.32xlarge, you want num_cores=32. There are 16 Trainium devices on the instance, but each of those has two Neuron cores each.

oemd001 commented 4 months ago

Hey @jimburtoft

I modified the command, and changed num_cores to be 32.

optimum-cli export neuron --model meta-llama/Meta-Llama-3-70B   --batch_size 1   --sequence_length 8192   --auto_cast_type fp16 --num_cores 32   llama3_70b_neuron/

After attempting to re-run the docker container, it seems like I'm still getting the same problem:

2024-06-02T04:05:36.987656Z ERROR text_generation_launcher: Error when initializing model
Traceback (most recent call last):
  File "/usr/local/bin/text-generation-server", line 8, in <module>
    sys.exit(app())
  File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 311, in __call__
    return get_command(self)(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 778, in main
    return _main(
  File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 216, in _main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 683, in wrapper
    return callback(**use_params)  # type: ignore
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/cli.py", line 60, in serve
    serve(model_id, revision, uds_path)
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 89, in serve
    asyncio.run(serve_inner(model_id, revision))
  File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/usr/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/usr/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once
    handle._run()
  File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
> File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 63, in serve_inner
    generator = NeuronGenerator.from_pretrained(model_id, revision)
  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/generator.py", line 589, in from_pretrained
    model = NeuronModelForCausalLM.from_pretrained(
  File "/usr/local/lib/python3.10/dist-packages/optimum/modeling_base.py", line 402, in from_pretrained
    return from_pretrained_method(
  File "/usr/local/lib/python3.10/dist-packages/optimum/neuron/utils/require_utils.py", line 50, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/optimum/neuron/modeling_decoder.py", line 324, in _from_transformers
    return cls._export(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/optimum/neuron/utils/require_utils.py", line 50, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/optimum/neuron/modeling_decoder.py", line 372, in _export
    return cls(new_config, checkpoint_dir, generation_config=generation_config)
  File "/usr/local/lib/python3.10/dist-packages/optimum/neuron/modeling.py", line 671, in __init__
    super().__init__(config, checkpoint_dir, compiled_dir=compiled_dir, generation_config=generation_config)
  File "/usr/local/lib/python3.10/dist-packages/optimum/neuron/utils/require_utils.py", line 50, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/optimum/neuron/modeling_decoder.py", line 215, in __init__
    neuronx_model.to_neuron()
  File "/usr/local/lib/python3.10/dist-packages/transformers_neuronx/base.py", line 73, in to_neuron
    self.setup()
  File "/usr/local/lib/python3.10/dist-packages/transformers_neuronx/base.py", line 64, in setup
    nbs.setup()
  File "/usr/local/lib/python3.10/dist-packages/transformers_neuronx/decoder.py", line 384, in setup
    self.program.setup(self.layers, self.pre_layer_parameters, self.ln_lm_head_params)
  File "/usr/local/lib/python3.10/dist-packages/transformers_neuronx/decoder.py", line 1613, in setup
    super().setup(layers, pre_layer_params, ln_lm_head_params)
  File "/usr/local/lib/python3.10/dist-packages/transformers_neuronx/decoder.py", line 1472, in setup
    kernel.load(io_ring_cache_size)
  File "/usr/local/lib/python3.10/dist-packages/transformers_neuronx/compiler.py", line 476, in load
    self.model.load()
RuntimeError: nrt_load_collectives status=2 message="Invalid"

2024-06-02T04:05:42.363198Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
2024-06-02T04:05:52.172344Z ERROR shard-manager: text_generation_launcher: Shard complete standard error output:

/usr/local/lib/python3.10/dist-packages/transformers_neuronx/decoder.py:154: UserWarning: KV head replication will be enabled since the number of KV heads (8) is not evenly divisible by the tensor parallel degree (24)
  warnings.warn(
2024-Jun-02 04:05:36.420077   273:478   ERROR   NRT:nrt_load_collectives                    Unsupported topology. Supported number of Neuron Cores is 1, 2, 8, 16 or a multiple of 32.  For other configurations, such as multiples of 2, consider using inf2 instances
2024-Jun-02 04:05:36.428977   273:478   ERROR   NRT:nrt_load_collectives                    Unsupported topology. Supported number of Neuron Cores is 1, 2, 8, 16 or a multiple of 32.  For other configurations, such as multiples of 2, consider using inf2 instances
2024-Jun-02 04:05:36.437750   273:478   ERROR   NRT:nrt_load_collectives                    Unsupported topology. Supported number of Neuron Cores is 1, 2, 8, 16 or a multiple of 32.  For other configurations, such as multiples of 2, consider using inf2 instances
2024-Jun-02 04:05:36.437757   273:478   ERROR   NRT:nrt_load_collectives                    Unsupported topology. Supported number of Neuron Cores is 1, 2, 8, 16 or a multiple of 32.  For other configurations, such as multiples of 2, consider using inf2 instances
2024-Jun-02 04:05:36.437761   273:478   ERROR   NRT:nrt_load_collectives                    Unsupported topology. Supported number of Neuron Cores is 1, 2, 8, 16 or a multiple of 32.  For other configurations, such as multiples of 2, consider using inf2 instances
2024-Jun-02 04:05:36.437765   273:478   ERROR   NRT:nrt_load_collectives                    Unsupported topology. Supported number of Neuron Cores is 1, 2, 8, 16 or a multiple of 32.  For other configurations, such as multiples of 2, consider using inf2 instances
2024-Jun-02 04:05:36.437769   273:478   ERROR   NRT:nrt_load_collectives                    Unsupported topology. Supported number of Neuron Cores is 1, 2, 8, 16 or a multiple of 32.  For other configurations, such as multiples of 2, consider using inf2 instances
2024-Jun-02 04:05:36.437773   273:478   ERROR   NRT:nrt_load_collectives                    Unsupported topology. Supported number of Neuron Cores is 1, 2, 8, 16 or a multiple of 32.  For other configurations, such as multiples of 2, consider using inf2 instances
2024-Jun-02 04:05:36.437778   273:478   ERROR   NRT:nrt_load_collectives                    Unsupported topology. Supported number of Neuron Cores is 1, 2, 8, 16 or a multiple of 32.  For other configurations, such as multiples of 2, consider using inf2 instances
2024-Jun-02 04:05:36.437782   273:478   ERROR   NRT:nrt_load_collectives                    Unsupported topology. Supported number of Neuron Cores is 1, 2, 8, 16 or a multiple of 32.  For other configurations, such as multiples of 2, consider using inf2 instances
2024-Jun-02 04:05:36.437786   273:478   ERROR   NRT:nrt_load_collectives                    Unsupported topology. Supported number of Neuron Cores is 1, 2, 8, 16 or a multiple of 32.  For other configurations, such as multiples of 2, consider using inf2 instances
2024-Jun-02 04:05:36.437790   273:478   ERROR   NRT:nrt_load_collectives                    Unsupported topology. Supported number of Neuron Cores is 1, 2, 8, 16 or a multiple of 32.  For other configurations, such as multiples of 2, consider using inf2 instances
2024-Jun-02 04:05:36.437794   273:478   ERROR   NRT:nrt_load_collectives                    Unsupported topology. Supported number of Neuron Cores is 1, 2, 8, 16 or a multiple of 32.  For other configurations, such as multiples of 2, consider using inf2 instances
2024-Jun-02 04:05:36.437798   273:478   ERROR   NRT:nrt_load_collectives                    Unsupported topology. Supported number of Neuron Cores is 1, 2, 8, 16 or a multiple of 32.  For other configurations, such as multiples of 2, consider using inf2 instances
2024-Jun-02 04:05:36.437802   273:478   ERROR   NRT:nrt_load_collectives                    Unsupported topology. Supported number of Neuron Cores is 1, 2, 8, 16 or a multiple of 32.  For other configurations, such as multiples of 2, consider using inf2 instances
2024-Jun-02 04:05:36.437806   273:478   ERROR   NRT:nrt_load_collectives                    Unsupported topology. Supported number of Neuron Cores is 1, 2, 8, 16 or a multiple of 32.  For other configurations, such as multiples of 2, consider using inf2 instances
2024-Jun-02 04:05:36.437810   273:478   ERROR   NRT:nrt_load_collectives                    Unsupported topology. Supported number of Neuron Cores is 1, 2, 8, 16 or a multiple of 32.  For other configurations, such as multiples of 2, consider using inf2 instances
2024-Jun-02 04:05:36.437815   273:478   ERROR   NRT:nrt_load_collectives                    Unsupported topology. Supported number of Neuron Cores is 1, 2, 8, 16 or a multiple of 32.  For other configurations, such as multiples of 2, consider using inf2 instances
2024-Jun-02 04:05:36.437819   273:478   ERROR   NRT:nrt_load_collectives                    Unsupported topology. Supported number of Neuron Cores is 1, 2, 8, 16 or a multiple of 32.  For other configurations, such as multiples of 2, consider using inf2 instances
2024-Jun-02 04:05:36.437823   273:478   ERROR   NRT:nrt_load_collectives                    Unsupported topology. Supported number of Neuron Cores is 1, 2, 8, 16 or a multiple of 32.  For other configurations, such as multiples of 2, consider using inf2 instances
2024-Jun-02 04:05:36.437827   273:478   ERROR   NRT:nrt_load_collectives                    Unsupported topology. Supported number of Neuron Cores is 1, 2, 8, 16 or a multiple of 32.  For other configurations, such as multiples of 2, consider using inf2 instances
2024-Jun-02 04:05:36.437831   273:478   ERROR   NRT:nrt_load_collectives                    Unsupported topology. Supported number of Neuron Cores is 1, 2, 8, 16 or a multiple of 32.  For other configurations, such as multiples of 2, consider using inf2 instances
2024-Jun-02 04:05:36.428966   273:488   ERROR   NRT:nrt_load_collectives                    Unsupported topology. Supported number of Neuron Cores is 1, 2, 8, 16 or a multiple of 32.  For other configurations, such as multiples of 2, consider using inf2 instances
2024-Jun-02 04:05:36.437742   273:486   ERROR   NRT:nrt_load_collectives                    Unsupported topology. Supported number of Neuron Cores is 1, 2, 8, 16 or a multiple of 32.  For other configurations, such as multiples of 2, consider using inf2 instances
Traceback (most recent call last):

  File "/usr/local/bin/text-generation-server", line 8, in <module>
    sys.exit(app())

  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/cli.py", line 60, in serve
    serve(model_id, revision, uds_path)

  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 89, in serve
    asyncio.run(serve_inner(model_id, revision))

  File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)

  File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()

  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/server.py", line 63, in serve_inner
    generator = NeuronGenerator.from_pretrained(model_id, revision)

  File "/usr/local/lib/python3.10/dist-packages/text_generation_server/generator.py", line 589, in from_pretrained
    model = NeuronModelForCausalLM.from_pretrained(

  File "/usr/local/lib/python3.10/dist-packages/optimum/modeling_base.py", line 402, in from_pretrained
    return from_pretrained_method(

  File "/usr/local/lib/python3.10/dist-packages/optimum/neuron/utils/require_utils.py", line 50, in wrapper
    return func(*args, **kwargs)

  File "/usr/local/lib/python3.10/dist-packages/optimum/neuron/modeling_decoder.py", line 324, in _from_transformers
    return cls._export(*args, **kwargs)

  File "/usr/local/lib/python3.10/dist-packages/optimum/neuron/utils/require_utils.py", line 50, in wrapper
    return func(*args, **kwargs)

  File "/usr/local/lib/python3.10/dist-packages/optimum/neuron/modeling_decoder.py", line 372, in _export
    return cls(new_config, checkpoint_dir, generation_config=generation_config)

  File "/usr/local/lib/python3.10/dist-packages/optimum/neuron/modeling.py", line 671, in __init__
    super().__init__(config, checkpoint_dir, compiled_dir=compiled_dir, generation_config=generation_config)

  File "/usr/local/lib/python3.10/dist-packages/optimum/neuron/utils/require_utils.py", line 50, in wrapper
    return func(*args, **kwargs)

  File "/usr/local/lib/python3.10/dist-packages/optimum/neuron/modeling_decoder.py", line 215, in __init__
    neuronx_model.to_neuron()

  File "/usr/local/lib/python3.10/dist-packages/transformers_neuronx/base.py", line 73, in to_neuron
    self.setup()

  File "/usr/local/lib/python3.10/dist-packages/transformers_neuronx/base.py", line 64, in setup
    nbs.setup()

  File "/usr/local/lib/python3.10/dist-packages/transformers_neuronx/decoder.py", line 384, in setup
    self.program.setup(self.layers, self.pre_layer_parameters, self.ln_lm_head_params)

  File "/usr/local/lib/python3.10/dist-packages/transformers_neuronx/decoder.py", line 1613, in setup
    super().setup(layers, pre_layer_params, ln_lm_head_params)

  File "/usr/local/lib/python3.10/dist-packages/transformers_neuronx/decoder.py", line 1472, in setup
    kernel.load(io_ring_cache_size)

  File "/usr/local/lib/python3.10/dist-packages/transformers_neuronx/compiler.py", line 476, in load
    self.model.load()

RuntimeError: nrt_load_collectives status=2 message="Invalid"
 rank=0
2024-06-02T04:05:52.241345Z ERROR text_generation_launcher: Shard 0 failed to start
2024-06-02T04:05:52.241369Z  INFO text_generation_launcher: Shutting down shards
Error: ShardCannotStart
oemd001 commented 4 months ago

For additional context, this is my config.json:

{
  "_name_or_path": "meta-llama/Meta-Llama-3-70B",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "hidden_act": "silu",
  "hidden_size": 8192,
  "initializer_range": 0.02,
  "intermediate_size": 28672,
  "max_position_embeddings": 8192,
  "mlp_bias": false,
  "model_type": "llama",
  "neuron": {
    "auto_cast_type": "fp16",
    "batch_size": 1,
    "checkpoint_id": "meta-llama/Meta-Llama-3-70B",
    "checkpoint_revision": "b4d08b7db49d488da3ac49adf25a6b9ac01ae338",
    "compiler_type": "neuronx-cc",
    "compiler_version": "2.13.72.0+78a426937",
    "num_cores": 32,
    "sequence_length": 8192,
    "task": "text-generation"
  },
  "num_attention_heads": 64,
  "num_hidden_layers": 80,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.41.1",
  "use_cache": true,
  "vocab_size": 128256
}
dacorvo commented 3 months ago

@oemd001 your invocation is incorrect: you must not pass the path to the checkpoint directory inside the neuron model directory, but the path to the neuron model directory instead:

optimum-cli export neuron --model meta-llama/Meta-Llama-3-70B   \
                                    --batch_size 1 \
                                    --sequence_length 8192 \
                                    --auto_cast_type fp16 \
                                    --num_cores 16 \
                                    ./data/llama3_70b_neuron

Then

docker run -p 8080:80  \
         -v /home/ubuntu/data:/data \
        --privileged \
       ghcr.io/huggingface/neuronx-tgi:latest \
       --model-id /data/llama3_70b_neuron
oemd001 commented 3 months ago

Let me give that a try! Will keep you updated :)

oemd001 commented 3 months ago

I got one last issue, (maybe I need an update of some sort) but I got this error message:

RuntimeError: Pretrained model is compiled with neuronx-cc(2.13.72.0+78a426937) newer than current compiler (2.13.66.0+6dfecc895), which may cause runtime incompatibilities.

I purged all existing images, using this command:

sudo docker rmi -f $(docker images -aq)

Going to downgrade my neuronx-cc version to whatever was posted and recompile and will let you know if there are any other issues

dacorvo commented 3 months ago

Your environment has a newer SDK version (2.18.2) than the one in TGI (2.18.0). You need to export the model using the TGI image (almost the same command but wrapped in docker). See the documentation here: https://huggingface.co/docs/optimum-neuron/guides/export_model#exporting-neuron-models-using-neuronx-tgi

oemd001 commented 3 months ago

@dacorvo, gotcha, makes sense. I actually just downgraded and it worked. While not ideal, it was just a band aid solution that I wanted to use to get it working.

It does work now! It's worth mentioning that the time to first token is on the slower end. Is there anything that can be done to increase the general time to first token speed, or does that require an update from AWS?

dacorvo commented 3 months ago

You can reduce the TTFT by adjusting the model static size (i.e sequence_length), but you must be aware that responses might be truncated if you reduce it too much. As a compromise we usually use 4096. The next release of the AWS Neuron SDK should provide a speedup: stay tuned !

oemd001 commented 3 months ago

Gotcha, sounds good! Thank you + @jimburtoft for the help, really appreciate it!

Closing this issue