mgonzs13 / llama_ros

llama.cpp (GGUF LLMs) and llava.cpp (GGUF VLMs) for ROS 2
MIT License
128 stars 21 forks source link

Issue building with CUDA enabled inside a docker container on Jetson Orin NX #8

Open kyle-redyeti opened 2 months ago

kyle-redyeti commented 2 months ago

I have been trying to compile the Llama ROS package with the two lines of the cmake file uncommented:

option(LLAMA_CUDA "llama: use CUDA" ON) add_compile_definitions(GGML_USE_CUDA)

I receive errors referencing: /usr/bin/ld: CMakeFiles/llava.dir/clip.cpp.o: in function clip_model_load': clip.cpp:(.text+0x214f8): undefined reference toggml_backend_cuda_init'

undefined reference to ggml_backend_cuda_host_buffer_type' undefined reference toggml_backend_cuda_get_device_count' undefined reference to `ggml_backend_cuda_reg_devices'

Trying to find similar errors I came across this from the Llama.cpp repo: https://github.com/ggerganov/llama.cpp/issues/5389

Also, it does look like there was some active development today on that repository so I am not sure if it could also be related to that???

It looks like CUDA is available (I can find the command I did but it showed CUDA devices available inside the docker container.

Also I am starting the container with... sudo docker run -it --net=host --runtime nvidia --gpus all ros2-mgonzs-llama Which should be giving CUDA access to the container...

Any ideas?

mgonzs13 commented 1 month ago

Hey @kyle-redyeti, can you check your nvcc version (nvcc --version)? I was able to compile with CUDA inside a Jetson Orin some time ago, let me check if it is still possible for me.

kyle-redyeti commented 1 month ago

nvcc --version Cuda compilation tools, release 12.2, V12.2.140 Build cuda_12.2.r12.2/compiler.33191640_0

kyle-redyeti commented 1 month ago

I am using a Jetson Orin NX running Jetpack 6. $ apt show nvidia-jetpack -a Description: NVIDIA Jetpack Meta Package Package: nvidia-jetpack Version: 6.0+b87

I am also starting with a ros-llm docker container. FROM dustynv/ros:humble-llm-r36.3.0

I figured that the llm container may have more of the things I would need (also was going to try to wrap NanoLLM in a ROS2 Node) but I will try the regular ros:humble-ros-base image instead to see if there is anything different...

kyle-redyeti commented 1 month ago

I was able to get it to build without major errors... (there is a warning that LLAMA_CUDA is being deprecated an to switch to GGML_CUDA but I think that is going to take more than just changing that line...) I started with dustynv/ros:humble-ros-base-l4t-r36.2.0 as my base image and then installed from a fork of your repo so I could have those two line not commented-out. I am not seeing any speed improvement when running the marcoroni.launch.

Hopefully you can verify that everything is still working correctly for you!
I am now on to trying to make your whole chatbot pipeline work!!!

Thanks!

Kyle

mgonzs13 commented 1 month ago

Hey @kyle-redyeti, I have updated llama_ros to change LLAMA_CUDA to GGML_CUDA. I have to test that docker image in my Jetson Orin but the last time I tried llama.cpp in a Jetson, the CUDA version didn't really improve the performance. There may be more parameters to optimize llama.cpp for Jetson.

kyle-redyeti commented 1 month ago

@mgonzs13 I will give it a try... I am now trying to get the colcon build to comeback with no warning (and warnings as errors) so that my docker build will work. I know when I was test llama_cpp_python on my old Xavier AGX I saw a fairly large improvement when enabling CUDA (If I remember correctly)... I did expect the same from the Orin... I will keep trying... thanks!

mgonzs13 commented 1 month ago

Hey @kyle-redyeti, I have tested the new llama_ros version within my Jetson Oring using Jetpack 6.0 and nvcc 12.2 and, though it is not as accelerated as in my other tests with a RTX 4070 and a GTX1060, the token generation is also updated. I am also thinking about modifying the chatbot_ros to use the tts as soon as the LLM stream tokens but now I am waiting for the whisper.cpp update.