Open Deeksha-20-99 opened 2 months ago
:octocat: cibot: Thank you for posting issue #2561. The person in charge will reply soon.
We also wanted to ask if we could run NNTrainer on a commercial off-the-shelf GPU. We currently have the NVIDIA A 6000.
GPU support of NNTrainer is WIP. I expect to see running LLMs on GPU around May~June. (e.g., https://github.com/nnstreamer/nntrainer/pull/2535 / https://github.com/nnstreamer/nntrainer/pulls?q=is%3Apr+author%3As-debadri ) @s-debadri has been actively contributing GPU-related codes.
This is based on OpenCL because we target GPUs of embedded devices (mobile, TV, home appliances, ...), not servers with powerful A100/H100/B100.
As long as they support OpenCL, they would work; however, not as efficient as CUDA on NVidia GPUs.
Do you have any recommendation for benchmarks to run to test results from LLaMA execution using NNTrainer?
Must-have metric: peak memory concumption, first-token latency, per-token latency after the first token output (or "throughput") Good-to-have metric: energy consumption (J) per given number of input tokens, throughput with given power (W) and thermal budgets, computation resource (CPU, GPU) utilization statistics, average and peak memory (DRAM) traffic. These additional metrics provide idea on how it would behave in actual user devices; battery consumption, throttled performance due to temperature, performance when there are other apps running, and so on.
Here we are not able to find the correlation between the input and output sequence, hence we wanted to check the way we can infer the results. With setting the locale we are encountering the segmentation error and wanted to know what could be done to resolve this.
@lhs8928 @baek2sm ?
We would like to thank the team for fixing the issue through the commit. We were able to overcome the segmentation fault and run the LLaMA model. We got the output as seen in the images but we are still not able to understand the output that is printed.
I wonder whether you changed the configuration for the 7b in HuggingFace. The current implementation is for the 1B. Do you want to use the Application/LLaMA as a kind of chatbot? then I think it needs some fixes as well. As you can see in the code, it just takes the prefill context and generates the output one time. For chatbot kind of task, we need a kind of iteration ( it is not difficult though) to keep the KV cache alive.
Here we are not able to find the correlation between the input and output sequence, hence we wanted to check the way we can infer the results. With setting the locale we are encountering the segmentation error and wanted to know what could be done to resolve this.
We will check and let you know.
Thank you for the clarification. We have been using the "meta-llama/Llama-2-7b-chat-hf", which is 7B. We planned to change the model to "TinyLlama/TinyLlama-1.1B-Chat-v1.0", is this the recommended one? If not is there any recommended model to be used for the LLaMA application?
We will check the model including TinyLlam. The current implementation is for the kind of tasks like summarization, tone conversion, etc. But TinyLlama seems like it does not have tokenizer compatibility with our implementation. Let us check and we will let you know.
Here we are not able to find the correlation between the input and output sequence, hence we wanted to check the way we can infer the results. With setting the locale we are encountering the segmentation error and wanted to know what could be done to resolve this.
Do you have any recommendation for benchmarks to run to test results from LLaMA execution using NNTrainer?
We also wanted to ask if we could run NNTrainer on a commercial off-the-shelf GPU. We currently have the NVIDIA A 6000.
Progress update by - Professor Hokeun Kim (https://github.com/hokeun) and his student Deeksha Prahlad (https://github.com/Deeksha-20-99)