huggingface / text-generation-inference

Large Language Model Text Generation Inference
http://hf.co/docs/text-generation-inference
Apache License 2.0
8.79k stars 1.02k forks source link

Support Phi-3.5 MoE #2457

Open maziyarpanahi opened 3 weeks ago

maziyarpanahi commented 3 weeks ago

Feature request

Add support for microsoft/Phi-3.5-MoE-instruct which has PhiMoEForCausalLM arch.

Motivation

It fails with the following error:

2024-08-25 21:25:51.891 | INFO     | text_generation_server.utils.import_utils:<module>:75 - Detected system cuda
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
Traceback (most recent call last):

  File "/opt/conda/bin/text-generation-server", line 8, in <module>
    sys.exit(app())

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/cli.py", line 118, in serve
    server.serve(

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 297, in serve
    asyncio.run(

  File "/opt/conda/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)

  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/server.py", line 231, in serve_inner
    model = get_model(

  File "/opt/conda/lib/python3.10/site-packages/text_generation_server/models/__init__.py", line 1064, in get_model
    raise NotImplementedError("sharded is not supported for AutoModel")

NotImplementedError: sharded is not supported for AutoModel
 rank=3
2024-08-25T21:25:56.550031Z ERROR text_generation_launcher: Shard 3 failed to start
2024-08-25T21:25:56.550058Z  INFO text_generation_launcher: Shutting down shards

Your contribution

I can test any PR

ErikKaum commented 3 weeks ago

Thanks for reporting this @maziyarpanahi 👍

We don't have at the moment a lot of extra bandwidth but we might prioritize adding this model.

Also as a note, to indicate more demand for a model getting thumbs ups or similar reactions on your issue is a signal for us to prioritize something :)