huggingface / text-generation-inference

Large Language Model Text Generation Inference
http://hf.co/docs/text-generation-inference
Apache License 2.0
9.09k stars 1.07k forks source link

No module named moe_kernel in Flash Attention Installation while compiling TGI2.3.1 #2621

Closed abhasin14 closed 3 weeks ago

abhasin14 commented 1 month ago

Task - Flash Attention Installation from Source. [Completed] Run- TGI2.3.1 with models that support for Flash attention enabled models. [Issue does not occur in TGI2.2.0]

Error - 2024-10-08T09:26:27.562016Z INFO text_generation_launcher: Using prefix caching = True 2024-10-08T09:26:27.562042Z INFO text_generation_launcher: Using Attention = flashinfer 2024-10-08T09:26:28.345909Z WARN text_generation_launcher: Could not import Flash Attention enabled models: No module named 'moe_kernels' 2024-10-08T09:26:28.555288Z WARN text_generation_launcher: Could not import Mamba: No module named 'causal_conv1d' 2024-10-08T09:26:29.238808Z ERROR text_generation_launcher: Error when initializing model Traceback (most recent call last): File "tgi_new_env/bin/text-generation-server", line 8, in sys.exit(app()) File "tgi_new_env/lib/python3.9/site-packages/typer/main.py", line 311, in call return get_command(self)(*args, kwargs) File "tgi_new_env/lib/python3.9/site-packages/click/core.py", line 1157, in call return self.main(args, kwargs) File "tgi_new_env/lib/python3.9/site-packages/typer/core.py", line 778, in main return _main( File "tgi_new_env/lib/python3.9/site-packages/typer/core.py", line 216, in _main rv = self.invoke(ctx) File "tgi_new_env/lib/python3.9/site-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/remote/vg_llm/anmolb/tgi/tgi_new_env/lib/python3.9/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, ctx.params) File "tgi_new_env/lib/python3.9/site-packages/click/core.py", line 783, in invoke return __callback(args, kwargs) File "tgi_new_env/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper return callback(*use_params) # type: ignore File "text-generation-inference-2.3.1/server/text_generation_server/cli.py", line 109, in serve server.serve( File "/text-generation-inference-2.3.1/server/text_generation_server/server.py", line 280, in serve asyncio.run( File "/usr/lib64/python3.9/asyncio/runners.py", line 44, in run return loop.run_until_complete(main) File "/usr/lib64/python3.9/asyncio/base_events.py", line 634, in run_until_complete self.run_forever() File "/usr/lib64/python3.9/asyncio/base_events.py", line 601, in run_forever self._run_once() File "/usr/lib64/python3.9/asyncio/base_events.py", line 1905, in _run_once handle._run() File "/usr/lib64/python3.9/asyncio/events.py", line 80, in _run self._context.run(self._callback, self._args)

File "text-generation-inference-2.3.1/server/text_generation_server/server.py", line 235, in serve_inner model = get_model_with_lora_adapters( File "text-generation-inference-2.3.1/server/text_generation_server/models/init.py", line 1277, in get_model_with_lora_adapters model = get_model( File "text-generation-inference-2.3.1/server/text_generation_server/models/init.py", line 843, in get_model raise NotImplementedError(FLASH_ATT_ERROR_MESSAGE.format("Sharded Llama")) NotImplementedError: Sharded Llama requires Flash Attention enabled models.

The result of pip show flash-attn to show Flash Attention:

Name: flash_attn Version: 2.6.3 Summary: Flash Attention: Fast and Memory-Efficient Exact Attention Home-page: https://github.com/Dao-AILab/flash-attention Author: Tri Dao Author-email: tri@tridao.me License: Location: tgi_new_env/lib/python3.9/site-packages/flash_attn-2.6.3-py3.9-linux-x86_64.egg Requires: einops, torch Required-by:

danieldk commented 1 month ago

moe-kernels is an optional install, so we should indeed import the module conditionally. Will make a PR to fix this. Thanks for reporting!

danieldk commented 3 weeks ago

Made mandatory and installed through a make install in #2632, so should fixed in the next release. Feel free to reopen if the issue occurs after the next release.