Closed abhasin14 closed 3 weeks ago
moe-kernels
is an optional install, so we should indeed import the module conditionally. Will make a PR to fix this. Thanks for reporting!
Made mandatory and installed through a make install
in #2632, so should fixed in the next release. Feel free to reopen if the issue occurs after the next release.
Task - Flash Attention Installation from Source. [Completed] Run- TGI2.3.1 with models that support for Flash attention enabled models. [Issue does not occur in TGI2.2.0]
Error - 2024-10-08T09:26:27.562016Z INFO text_generation_launcher: Using prefix caching = True 2024-10-08T09:26:27.562042Z INFO text_generation_launcher: Using Attention = flashinfer 2024-10-08T09:26:28.345909Z WARN text_generation_launcher: Could not import Flash Attention enabled models: No module named 'moe_kernels' 2024-10-08T09:26:28.555288Z WARN text_generation_launcher: Could not import Mamba: No module named 'causal_conv1d' 2024-10-08T09:26:29.238808Z ERROR text_generation_launcher: Error when initializing model Traceback (most recent call last): File "tgi_new_env/bin/text-generation-server", line 8, in
sys.exit(app())
File "tgi_new_env/lib/python3.9/site-packages/typer/main.py", line 311, in call
return get_command(self)(*args, kwargs)
File "tgi_new_env/lib/python3.9/site-packages/click/core.py", line 1157, in call
return self.main(args, kwargs)
File "tgi_new_env/lib/python3.9/site-packages/typer/core.py", line 778, in main
return _main(
File "tgi_new_env/lib/python3.9/site-packages/typer/core.py", line 216, in _main
rv = self.invoke(ctx)
File "tgi_new_env/lib/python3.9/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/remote/vg_llm/anmolb/tgi/tgi_new_env/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, ctx.params)
File "tgi_new_env/lib/python3.9/site-packages/click/core.py", line 783, in invoke
return __callback(args, kwargs)
File "tgi_new_env/lib/python3.9/site-packages/typer/main.py", line 683, in wrapper
return callback(*use_params) # type: ignore
File "text-generation-inference-2.3.1/server/text_generation_server/cli.py", line 109, in serve
server.serve(
File "/text-generation-inference-2.3.1/server/text_generation_server/server.py", line 280, in serve
asyncio.run(
File "/usr/lib64/python3.9/asyncio/runners.py", line 44, in run
return loop.run_until_complete(main)
File "/usr/lib64/python3.9/asyncio/base_events.py", line 634, in run_until_complete
self.run_forever()
File "/usr/lib64/python3.9/asyncio/base_events.py", line 601, in run_forever
self._run_once()
File "/usr/lib64/python3.9/asyncio/base_events.py", line 1905, in _run_once
handle._run()
File "/usr/lib64/python3.9/asyncio/events.py", line 80, in _run
self._context.run(self._callback, self._args)
The result of pip show flash-attn to show Flash Attention:
Name: flash_attn Version: 2.6.3 Summary: Flash Attention: Fast and Memory-Efficient Exact Attention Home-page: https://github.com/Dao-AILab/flash-attention Author: Tri Dao Author-email: tri@tridao.me License: Location: tgi_new_env/lib/python3.9/site-packages/flash_attn-2.6.3-py3.9-linux-x86_64.egg Requires: einops, torch Required-by: