llama 3.1 405B fp8 support

endomorphosis commented 1 month ago

I have been staging some updates testing the tgi-gaudi software with llama 405B fp8, i am waiting for habana optimum to approve the PR, and then I will submit a pr for huggingface/tgi_gaudi and will submit a PR for TGI in the microservices.

I got it running on xeon with llama_cpp (which is what ollama is based on) at 1 tok/s on sapphire rapids, but am going to test speculative decoding for llama 3.1 8b, which should improve performance 10-20 times depending on how many tokens can be completed by the draft model. However ollama is broken and that will need to be investigated further.

endomorphosis commented 1 month ago

I have thoroughly gone through all of the examples and interfaces with regards to

optimum-habana Intel Neural Compressor (3.0 and 2.4) tgi-gaudi

Currently the situation is that the libraries are in a state of disrepair for everything but bf16, as a result of a lack of unit testing, integration testing, regression testing. The examples do not work because they were designed for previous versions of libraries that are no longer working with the other versions of the libraries, the only thing that does work with regards to quantization is compile time quantization, which does not actually reduce the number of devices needed to run a model, but does increase the inference speed of the models, however with llama 3.1 405b it is currently impossible to run it on a single node, but only because the software packages are not being maintained in a functioning state.

I have spend 3 days so far on this endeavor, and I am unwilling to take the time needed to become a maintainer of those libraries, even if I do want to reduce hallucinations in my language modeling tasks, as I have been getting asked to complete my AGPL edge oriented mlops infrastructure package more quickly by @jaanli so that he can migrate away from google tpu cloud.

https://github.com/endomorphosis/ipfs_transformers/issues/1#issuecomment-2282238714

jaanli commented 1 month ago

Thanks so much @endomorphosis and on behalf of @onefact! Giving a talk on Thursday if it's possible to demo any edge models at https://duckdb.org/2024/08/15/duckcon5

Even just a encoder-only small transformer like what I did before: https://arxiv.org/abs/1904.05342 (let me know if you need HF links :)

endomorphosis commented 1 month ago

Thanks so much @endomorphosis and on behalf of @onefact! Giving a talk on Thursday if it's possible to demo any edge models at https://duckdb.org/2024/08/15/duckcon5

Even just a encoder-only small transformer like what I did before: https://arxiv.org/abs/1904.05342 (let me know if you need HF links :)

I have no idea what hardware you are running it on.

jaanli commented 1 month ago

Thanks so much @endomorphosis and on behalf of @onefact! Giving a talk on Thursday if it's possible to demo any edge models at https://duckdb.org/2024/08/15/duckcon5 Even just a encoder-only small transformer like what I did before: https://arxiv.org/abs/1904.05342 (let me know if you need HF links :)

I have no idea what hardware you are running it on.

Ah yes, sorry - iPhone pro max 15 with latest firmware.

endomorphosis commented 2 weeks ago

https://github.com/HabanaAI/vllm-fork/pull/144 I am going to continue with attempting to get llama 405b with speculative decoding with llama 8b working, and process some Wikipedia datasets and embeddings.

opea-project / GenAIComps

llama 3.1 405B fp8 support #383