Jetson orin agx - Githubissues

mudler / LocalAI

:robot: The free, Open Source OpenAI alternative. Self-hosted, community-driven and local-first. Drop-in replacement for OpenAI running on consumer-grade hardware. No GPU required. Runs gguf, transformers, diffusers and many more models architectures. It allows to generate Text, Audio, Video, Images. Also with voice cloning capabilities.

https://localai.io

MIT License

21.73k stars 1.66k forks source link

Jetson orin agx #1112

Open MrBrabus75 opened 9 months ago

MrBrabus75 commented 9 months ago

Hello,

its possible to install it on Jetson orin agx ? (its nvidia tpu) ubuntu aarch64 (arm64) ?

if yes, can you explain me how ??

thanks you

mudler commented 9 months ago

yes it is possible, I've tried this out on a board I have here - altough I'm running it without docker. I'll add this to my to-do as I have few notes already that can be converted to docs instead

MrBrabus75 commented 9 months ago

yes it is possible, I've tried this out on a board I have here - altough I'm running it without docker. I'll add this to my to-do as I have few notes already that can be converted to docs instead

Oh, thank you, that would be great, because I have to do a lab assignment on this for school, and we only have Jetson AGX Orin devices, running on aarch64 (arm64) with Ubuntu 20.04 (5.10.120-tegra), and using a PyTorch version specific to the Jetson. So, this is complicated.

Thank you very much, I look forward to your response eagerly. thanks :D

localai-bot commented 9 months ago

:warning::warning::warning::warning::warning:

Hi! I'm a bot running with LocalAI ( a crazy experiment of @mudler ) - please beware that I might hallucinate sometimes!

_but.... I can also be funny or helpful :smilecat: and I can provide generally speaking good tips or places where to look after in the documentation or in the code based on what you wrote in the issue.

Don't engage in conversation with me, I don't support (yet) replying!

:warning::warning::warning::warning::warning:

Yes, LocalAI supports Jetson and AGX. Here are the instructions to install it on Jetson Nano:

First, make sure you have the latest version of Jetson OS installed on your device.
Download the latest version of LocalAI from the GitHub repository (https://github.com/go-skynet/LocalAI).
Extract the archive and navigate to the directory where the installation script is located.
Run the following command to install LocalAI:
```
sudo bash install.sh --gpus all --cpu-type aarch64 --build-type release --debug=true
```
This will install LocalAI on your Jetson Nano and enable all available GPUs. You can adjust the other options to fit your specific needs.
Once the installation is complete, you can start using LocalAI by running the following command:
```
localai
```
This will launch the CLI interface, from which you can interact with the AI models and run experiments. Note that you may need to adjust the memory settings in the Jetson OS to ensure that your device has enough resources to run LocalAI smoothly.

Sources:

MrBrabus75 commented 9 months ago

hello, I forgot to specify the PyTorch version that works with JetPack 5.1 on Nvidia's Jetson AGX devices. Here it is: export https://developer.download.nvidia.cn/compute/redist/jp/v511/pytorch/torch-2.0.0+nv23.05-cp38-cp38-linux_aarch64.whl.

sorry, thanks you very much

MrBrabus75 commented 9 months ago

I apologize if I wasn't clear earlier. To summarize:

I'm trying to install localai on an NVIDIA Jetson AGX Orin.
This device operates on Ubuntu 20.04 (tegra 5.10.120), which is an ARM64 version.
The Jetson runs on Python 3.8, and I cannot upgrade to a newer version like Python 3.10 due to specific dependencies on this platform.
It uses a specific version of PyTorch that requires Python 3.8 and is optimized for CUDA on this platform. You can find this version of PyTorch here: [PyTorch link](https://developer.download.nvidia.cn/compute/redist/jp/v511/pytorch/torch-2.0.0+nv23.05-cp38-cp38-linux_aarch64.whl).
I aim to get guidance on making localai work with this particular PyTorch version and the Jetson.

Is it possible to add support for this configuration? If not, could you guide me through the necessary steps to extend the support on my own, such as creating a PR?

Thank you for your understanding.

mudler commented 5 months ago

Please leave it open - I'd like to get at this as I have some spare time

MrBrabus75 commented 5 months ago

Oh okay, im sorry.

thanks you very much for you're help

FutureProofHomes commented 4 months ago

@mudler any updates on running LocalAI within Docker on Jetson Orin? Many thanks!

jetson@ubuntu:~/Developer/LocalAI$ docker compose up -d --pull always
[+] Running 0/1
 ⠇ local-ai Pulling                                                                                                                                               0.9s 
no matching manifest for linux/arm64/v8 in the manifest list entries

ToeiRei commented 4 months ago

The current "conda" stuff doesn't have a proper aarch64/arm64 build it seems. You'd have to build it yourself, dropping conda in the process which shouldn't be a problem as long as you're only building the core image

mudler commented 4 months ago

@FutureProofHomes there are no container images, however, I made it to work also on the Orin board. As @ToeiRei mentioned here, you can either build the core image, or just build it from source.

In my case, I build it from source on the board and setup the python environment manually. I've just added some sparse support to non-conda environment with my work in #1653 , should be easily extensible for diffusers generically. However, in the case of Intel GPUs it assumes the dependencies are already installed.

For the Orin AGX board, this is quite needed if you plan to use image generation, as you need a specific torch version for the board, that limits slightly setup/automation for now, but maybe we can collect the requirements for the specific case. I don't have access to my board now, but will have next week - I can likely check what's installed there and report here.

mudler commented 4 months ago

From my notes: I've installed pytorch from here: https://forums.developer.nvidia.com/t/pytorch-for-jetson/72048

however, at the time of setup there was only a compatible diffuser version, which was https://github.com/huggingface/transformers/releases/tag/v4.32.0 and diffusers-0.20.2 (didn't tested other versions)

I had also to set:

LD_PRELOAD=/lib/aarch64-linux-gnu/libGLdispatch.so.0
LD_LIBRARY_PATH=/usr/local/cuda/lib64/

while running the binary (for pytorch and diffusers to work).

In case ggml/llama.cpp compilation fails, here is a patch that I used to apply:

diff --git a/ggml.h b/ggml.h
index 255541d..368ac2e 100644
--- a/ggml.h
+++ b/ggml.h
@@ -212,9 +212,11 @@
 extern "C" {
 #endif

-#ifdef __ARM_NEON
+#if defined(__ARM_NEON) && !defined(__CUDACC__)
     // we use the built-in 16-bit float type
     typedef __fp16 ggml_fp16_t;
+#elif defined(__ARM_NEON) && defined(__CUDACC__)
+    typedef half ggml_fp16_t;
 #else
     typedef uint16_t ggml_fp16_t;
 #endif

Note you need cmake (quite a recent version), I'm not sure if it was not available, so I have this in my notes too:

# Install CMAKE: 
wget https://github.com/Kitware/CMake/releases/download/v3.26.4/cmake-3.26.4.tar.gz
tar xvf cmake-3.26.4.tar.gz
cd cmake-3.26.4
./configure
make
make install
# Install golang
wget https://go.dev/dl/go1.20.4.linux-arm64.tar.gz
rm -rf /usr/local/go && sudo tar -C /usr/local -xzf go1.20.4.linux-arm64.tar.gz 
# Build LocalAI
export PATH=/usr/local/go/bin:$PATH
export PATH=/usr/local/cuda/bin/:$PATH
make BUILD_TYPE=cublas build

ToeiRei commented 4 months ago

@FutureProofHomes Small hint: you'd have to actually remove the conda install around line 67 in the Dockerfile to actually get past that point.

FutureProofHomes commented 4 months ago

Thanks for the tips everyone. I really want/need to get this up and running inside a container. I'm gonna keep plugging away and if I pull this off I'll share the solution. For anyone who has pulled this off, please share!

FutureProofHomes commented 4 months ago

Okay, I set my Dockerfile to only build core and the container runs perfectly, but only powered by Jetson's CPU (no GPU).

ARG IMAGE_TYPE=core

By the way, will the core build support OpenAI Functions? Probably not, huh?

I then set out to enable GPU support by setting LD_LIBRARY, LD_LIBRARY_PATH & BUILD_TYPE in my .env file:

## Set number of threads.
## Note: prefer the number of physical cores. Overbooking the CPU degrades performance notably.
# THREADS=14

## Specify a different bind address (defaults to ":8080")
# ADDRESS=127.0.0.1:8080

## Default models context size
# CONTEXT_SIZE=512
#
## Define galleries.
## models will to install will be visible in `/models/available`
# GALLERIES=[{"name":"model-gallery", "url":"github:go-skynet/model-gallery/index.yaml"}]

## CORS settings
# CORS=true
# CORS_ALLOW_ORIGINS=*

## Default path for models
#
MODELS_PATH=/models

## Enable debug mode
DEBUG=true

## Disables COMPEL (Diffusers)
# COMPEL=0

## Enable/Disable single backend (useful if only one GPU is available)
# SINGLE_ACTIVE_BACKEND=true

## Specify a build type. Available: cublas, openblas, clblas.
## cuBLAS: This is a GPU-accelerated version of the complete standard BLAS (Basic Linear Algebra Subprograms) library. It's provided by Nvidia and is part of their CUDA toolkit.
## OpenBLAS: This is an open-source implementation of the BLAS library that aims to provide highly optimized code for various platforms. It includes support for multi-threading and can be compiled to use hardware-specific features for additional performance. OpenBLAS can run on many kinds of hardware, including CPUs from Intel, AMD, and ARM.
## clBLAS:   This is an open-source implementation of the BLAS library that uses OpenCL, a framework for writing programs that execute across heterogeneous platforms consisting of CPUs, GPUs, and other processors. clBLAS is designed to take advantage of the parallel computing power of GPUs but can also run on any hardware that supports OpenCL. This includes hardware from different vendors like Nvidia, AMD, and Intel.
BUILD_TYPE=cublas

## Uncomment and set to true to enable rebuilding from source
REBUILD=true

## Enable go tags, available: stablediffusion, tts
## stablediffusion: image generation with stablediffusion
## tts: enables text-to-speech with go-piper 
## (requires REBUILD=true)
#
# GO_TAGS=stablediffusion

## Path where to store generated images
# IMAGE_PATH=/tmp

## Specify a default upload limit in MB (whisper)
# UPLOAD_LIMIT

## List of external GRPC backends (note on the container image this variable is already set to use extra backends available in extra/)
# EXTERNAL_GRPC_BACKENDS=my-backend:127.0.0.1:9000,my-backend2:/usr/bin/backend.py

### Advanced settings ###
### Those are not really used by LocalAI, but from components in the stack ###
##
### Preload libraries
#LD_PRELOAD=
LD_PRELOAD=/lib/aarch64-linux-gnu/libGLdispatch.so.0
LD_LIBRARY_PATH=/usr/local/cuda/lib64/

### Huggingface cache for models
# HUGGINGFACE_HUB_CACHE=/usr/local/huggingface

### Python backends GRPC max workers
### Default number of workers for GRPC Python backends.
### This actually controls wether a backend can process multiple requests or not.
# PYTHON_GRPC_MAX_WORKERS=1

### Define the number of parallel LLAMA.cpp workers (Defaults to 1)
# LLAMACPP_PARALLEL=1

### Enable to run parallel requests
# PARALLEL_REQUESTS=true

### Watchdog settings
###
# Enables watchdog to kill backends that are inactive for too much time
# WATCHDOG_IDLE=true
#
# Enables watchdog to kill backends that are busy for too much time
# WATCHDOG_BUSY=true
#
# Time in duration format (e.g. 1h30m) after which a backend is considered idle
# WATCHDOG_IDLE_TIMEOUT=5m
#
# Time in duration format (e.g. 1h30m) after which a backend is considered busy
# WATCHDOG_BUSY_TIMEOUT=5m

Here is my models/luna.yml file. By the way, should I set f16: true? I did try this and it didn't solve my blocker though.

name: luna
parameters:
  model: luna-ai-llama2-uncensored.Q6_K.gguf
  top_k: 90
  temperature: 0.2
  top_p: 0.7
context_size: 4096
threads: 6
gpu_layers: 50
f16: false
mmap: true
backend: llama
roles:
  assistant: 'ASSISTANT:'
  system: 'SYSTEM:'
  user: 'USER:'
template:
  chat: lunademo-chat
  completion: lunademo-completion

I then run docker compose up -d --pull always and the container starts up. When I check docker logs -f localai-api-1 i see the following error upon build:

make: *** [Makefile:514: backend-assets/grpc/bert-embeddings] Error 1
go mod edit -replace github.com/nomic-ai/gpt4all/gpt4all-bindings/golang=/build/sources/gpt4all/gpt4all-bindings/golang
go mod edit -replace github.com/donomii/go-rwkv.cpp=/build/sources/go-rwkv
go mod edit -replace github.com/ggerganov/whisper.cpp=/build/sources/whisper.cpp
go mod edit -replace github.com/ggerganov/whisper.cpp/bindings/go=/build/sources/whisper.cpp/bindings/go
go mod edit -replace github.com/go-skynet/go-bert.cpp=/build/sources/go-bert
go mod edit -replace github.com/mudler/go-stable-diffusion=/build/sources/go-stable-diffusion
go mod edit -replace github.com/M0Rf30/go-tiny-dream=/build/sources/go-tiny-dream
go mod edit -replace github.com/mudler/go-piper=/build/sources/go-piper
go mod download
touch prepare-sources
touch prepare
CGO_LDFLAGS="-lcublas -lcudart -L/usr/local/cuda/lib64/" C_INCLUDE_PATH=/build/sources/go-bert LIBRARY_PATH=/build/sources/go-bert \
go build -ldflags "-X "github.com/go-skynet/LocalAI/internal.Version=v2.9.0-39-ge022b59" -X "github.com/go-skynet/LocalAI/internal.Commit=e022b5959ea409586bcead3473bbe8c180b9d2bf"" -tags "" -o backend-assets/grpc/bert-embeddings ./backend/go/llm/bert/
# github.com/go-skynet/go-bert.cpp
In file included from gobert.cpp:6:
sources/go-bert/bert.cpp/bert.cpp: In function 'bert_ctx* bert_load_from_file(const char*)':
sources/go-bert/bert.cpp/bert.cpp:610:89: warning: format '%lld' expects argument of type 'long long int', but argument 5 has type 'int64_t' {aka 'long int'} [-Wformat=]
  610 |                 fprintf(stderr, "%s: tensor '%s' has wrong shape in model file: got [%lld, %lld], expected [%lld, %lld]\n",
      |                                                                                      ~~~^
      |                                                                                         |
      |                                                                                         long long int
      |                                                                                      %ld
  611 |                         __func__, name.data(), tensor->ne[0], tensor->ne[1], ne[0], ne[1]);
      |                                                ~~~~~~~~~~~~~                             
      |                                                            |
      |                                                            int64_t {aka long int}
sources/go-bert/bert.cpp/bert.cpp:610:95: warning: format '%lld' expects argument of type 'long long int', but argument 6 has type 'int64_t' {aka 'long int'} [-Wformat=]
  610 |                 fprintf(stderr, "%s: tensor '%s' has wrong shape in model file: got [%lld, %lld], expected [%lld, %lld]\n",
      |                                                                                            ~~~^
      |                                                                                               |
      |                                                                                               long long int
      |                                                                                            %ld
  611 |                         __func__, name.data(), tensor->ne[0], tensor->ne[1], ne[0], ne[1]);
      |                                                               ~~~~~~~~~~~~~                    
      |                                                                           |
      |                                                                           int64_t {aka long int}
sources/go-bert/bert.cpp/bert.cpp:610:112: warning: format '%lld' expects argument of type 'long long int', but argument 7 has type 'int64_t' {aka 'long int'} [-Wformat=]
  610 |                 fprintf(stderr, "%s: tensor '%s' has wrong shape in model file: got [%lld, %lld], expected [%lld, %lld]\n",
      |                                                                                                             ~~~^
      |                                                                                                                |
      |                                                                                                                long long int
      |                                                                                                             %ld
  611 |                         __func__, name.data(), tensor->ne[0], tensor->ne[1], ne[0], ne[1]);
      |                                                                              ~~~~~                              
      |                                                                                  |
      |                                                                                  int64_t {aka long int}
sources/go-bert/bert.cpp/bert.cpp:610:118: warning: format '%lld' expects argument of type 'long long int', but argument 8 has type 'int64_t' {aka 'long int'} [-Wformat=]
  610 |                 fprintf(stderr, "%s: tensor '%s' has wrong shape in model file: got [%lld, %lld], expected [%lld, %lld]\n",
      |                                                                                                                   ~~~^
      |                                                                                                                      |
      |                                                                                                                      long long int
      |                                                                                                                   %ld
  611 |                         __func__, name.data(), tensor->ne[0], tensor->ne[1], ne[0], ne[1]);
      |                                                                                     ~~~~~                             
      |                                                                                         |
      |                                                                                         int64_t {aka long int}
sources/go-bert/bert.cpp/bert.cpp:624:37: warning: format '%lld' expects argument of type 'long long int', but argument 3 has type 'int64_t' {aka 'long int'} [-Wformat=]
  624 |                 printf("%24s - [%5lld, %5lld], type = %6s, %6.2f MB, %9zu bytes\n", name.data(), ne[0], ne[1], ftype_str[ftype], ggml_nbytes(tensor) / 1024.0 / 1024.0, ggml_nbytes(tensor));
      |                                 ~~~~^                                                            ~~~~~
      |                                     |                                                                |
      |                                     long long int                                                    int64_t {aka long int}
      |                                 %5ld
sources/go-bert/bert.cpp/bert.cpp:624:44: warning: format '%lld' expects argument of type 'long long int', but argument 4 has type 'int64_t' {aka 'long int'} [-Wformat=]
  624 |                 printf("%24s - [%5lld, %5lld], type = %6s, %6.2f MB, %9zu bytes\n", name.data(), ne[0], ne[1], ftype_str[ftype], ggml_nbytes(tensor) / 1024.0 / 1024.0, ggml_nbytes(tensor));
      |                                        ~~~~^                                                            ~~~~~
      |                                            |                                                                |
      |                                            long long int                                                    int64_t {aka long int}
      |                                        %5ld
sources/go-bert/bert.cpp/bert.cpp:655:101: warning: format '%llu' expects argument of type 'long long unsigned int', but argument 6 has type 'long unsigned int' [-Wformat=]
  655 |                 fprintf(stderr, "%s: tensor '%s' has wrong size in model file: got %zu, expected %llu\n",
      |                                                                                                  ~~~^
      |                                                                                                     |
      |                                                                                                     long long unsigned int
      |                                                                                                  %lu
  656 |                         __func__, name.data(), ggml_nbytes(tensor), nelements * bpe);
      |                                                                     ~~~~~~~~~~~~~~~                  
      |                                                                               |
      |                                                                               long unsigned int
sources/go-bert/bert.cpp/bert.cpp:692:56: warning: format '%lld' expects argument of type 'long long int', but argument 4 has type 'int64_t' {aka 'long int'} [-Wformat=]
  692 |     printf("%s: mem_per_token %zd KB, mem_per_input %lld MB\n", __func__, new_bert->mem_per_token / (1 << 10), new_bert->mem_per_input / (1 << 20));
      |                                                     ~~~^                                                       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                                                        |                                                                               |
      |                                                        long long int                                                                   int64_t {aka long int}
      |                                                     %ld
# github.com/go-skynet/LocalAI/backend/go/llm/bert
/usr/local/go/pkg/tool/linux_arm64/link: running g++ failed: exit status 1
/usr/bin/ld: cannot find -lcublas: No such file or directory
/usr/bin/ld: cannot find -lcudart: No such file or directory
/usr/bin/ld: cannot find -lcublas: No such file or directory
/usr/bin/ld: cannot find -lcudart: No such file or directory
/usr/bin/ld: cannot find -lcublas: No such file or directory
/usr/bin/ld: cannot find -lcudart: No such file or directory
/usr/bin/ld: cannot find -lcublas: No such file or directory
/usr/bin/ld: cannot find -lcudart: No such file or directory
collect2: error: ld returned 1 exit status

make: *** [Makefile:514: backend-assets/grpc/bert-embeddings] Error 1

In summary, the only way I can get the container to build and run is if I comment out BUILD_TYPE=cublas from my .env and run via CPU. Thanks for any tips on next steps @mudler.

ToeiRei commented 4 months ago

Just asking: You have the nvidia docker parts done and all that?

FutureProofHomes commented 4 months ago

Just asking: You have the nvidia docker parts done and all that?

@ToeiRei, I have successful run stable diffusion and llama.cpp from https://github.com/dusty-nv/jetson-containers repo, so I think I have all the prerequisites in place.

Anything specific you'd like me to check? Thanks for help, btw.

FutureProofHomes commented 3 months ago

FYI - this is the same issue I'm having: https://github.com/mudler/LocalAI/discussions/601.

FutureProofHomes commented 3 months ago

Another bit of info:

FutureProofHomes commented 3 months ago

Figured it out. I needed to add the following to my bash profile. Then rebuilt the container with no cache.

export PATH=/usr/local/cuda/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda/lib64\
                         ${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

FutureProofHomes commented 3 months ago

Disregard the above. I did NOT figure it out. The hunt continues. Same error. WTH!

docker logs -f localai-api-1
go mod edit -replace github.com/nomic-ai/gpt4all/gpt4all-bindings/golang=/build/sources/gpt4all/gpt4all-bindings/golang
go mod edit -replace github.com/donomii/go-rwkv.cpp=/build/sources/go-rwkv
go mod edit -replace github.com/ggerganov/whisper.cpp=/build/sources/whisper.cpp
go mod edit -replace github.com/ggerganov/whisper.cpp/bindings/go=/build/sources/whisper.cpp/bindings/go
go mod edit -replace github.com/go-skynet/go-bert.cpp=/build/sources/go-bert
go mod edit -replace github.com/mudler/go-stable-diffusion=/build/sources/go-stable-diffusion
go mod edit -replace github.com/M0Rf30/go-tiny-dream=/build/sources/go-tiny-dream
go mod edit -replace github.com/mudler/go-piper=/build/sources/go-piper
go mod download
touch prepare-sources
touch prepare
go build -ldflags "-X "github.com/go-skynet/LocalAI/internal.Version=v2.9.0-39-ge022b59" -X "github.com/go-skynet/LocalAI/internal.Commit=e022b5959ea409586bcead3473bbe8c180b9d2bf"" -tags "" -o backend-assets/grpc/langchain-huggingface ./backend/go/llm/langchain/
make -C sources/go-bert libgobert.a
make[1]: Entering directory '/build/sources/go-bert'
I go-gpt4all-j build info: 
I UNAME_S:  Linux
I UNAME_P:  aarch64
I UNAME_M:  aarch64
I CFLAGS:   -I. -I./bert.cpp/ggml/include/ggml/ -I./bert.cpp/ -I -O3 -DNDEBUG -std=c11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -Wno-unused-function -pthread -mcpu=native
I CXXFLAGS: -I. -I./bert.cpp/ggml/include/ggml/ -I./bert.cpp/ -O3 -DNDEBUG -std=c++17 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -mcpu=native
I LDFLAGS:  
I CMAKEFLAGS:  
I CC:       cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
I CXX:      g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

make[1]: 'libgobert.a' is up to date.
make[1]: Leaving directory '/build/sources/go-bert'
CGO_LDFLAGS="-lcublas -lcudart -L/usr/local/cuda/lib64/" C_INCLUDE_PATH=/build/sources/go-bert LIBRARY_PATH=/build/sources/go-bert \
go build -ldflags "-X "github.com/go-skynet/LocalAI/internal.Version=v2.9.0-39-ge022b59" -X "github.com/go-skynet/LocalAI/internal.Commit=e022b5959ea409586bcead3473bbe8c180b9d2bf"" -tags "" -o backend-assets/grpc/bert-embeddings ./backend/go/llm/bert/
# github.com/go-skynet/go-bert.cpp
In file included from gobert.cpp:6:
sources/go-bert/bert.cpp/bert.cpp: In function 'bert_ctx* bert_load_from_file(const char*)':
sources/go-bert/bert.cpp/bert.cpp:610:89: warning: format '%lld' expects argument of type 'long long int', but argument 5 has type 'int64_t' {aka 'long int'} [-Wformat=]
  610 |                 fprintf(stderr, "%s: tensor '%s' has wrong shape in model file: got [%lld, %lld], expected [%lld, %lld]\n",
      |                                                                                      ~~~^
      |                                                                                         |
      |                                                                                         long long int
      |                                                                                      %ld
  611 |                         __func__, name.data(), tensor->ne[0], tensor->ne[1], ne[0], ne[1]);
      |                                                ~~~~~~~~~~~~~                             
      |                                                            |
      |                                                            int64_t {aka long int}
sources/go-bert/bert.cpp/bert.cpp:610:95: warning: format '%lld' expects argument of type 'long long int', but argument 6 has type 'int64_t' {aka 'long int'} [-Wformat=]
  610 |                 fprintf(stderr, "%s: tensor '%s' has wrong shape in model file: got [%lld, %lld], expected [%lld, %lld]\n",
      |                                                                                            ~~~^
      |                                                                                               |
      |                                                                                               long long int
      |                                                                                            %ld
  611 |                         __func__, name.data(), tensor->ne[0], tensor->ne[1], ne[0], ne[1]);
      |                                                               ~~~~~~~~~~~~~                    
      |                                                                           |
      |                                                                           int64_t {aka long int}
sources/go-bert/bert.cpp/bert.cpp:610:112: warning: format '%lld' expects argument of type 'long long int', but argument 7 has type 'int64_t' {aka 'long int'} [-Wformat=]
  610 |                 fprintf(stderr, "%s: tensor '%s' has wrong shape in model file: got [%lld, %lld], expected [%lld, %lld]\n",
      |                                                                                                             ~~~^
      |                                                                                                                |
      |                                                                                                                long long int
      |                                                                                                             %ld
  611 |                         __func__, name.data(), tensor->ne[0], tensor->ne[1], ne[0], ne[1]);
      |                                                                              ~~~~~                              
      |                                                                                  |
      |                                                                                  int64_t {aka long int}
sources/go-bert/bert.cpp/bert.cpp:610:118: warning: format '%lld' expects argument of type 'long long int', but argument 8 has type 'int64_t' {aka 'long int'} [-Wformat=]
  610 |                 fprintf(stderr, "%s: tensor '%s' has wrong shape in model file: got [%lld, %lld], expected [%lld, %lld]\n",
      |                                                                                                                   ~~~^
      |                                                                                                                      |
      |                                                                                                                      long long int
      |                                                                                                                   %ld
  611 |                         __func__, name.data(), tensor->ne[0], tensor->ne[1], ne[0], ne[1]);
      |                                                                                     ~~~~~                             
      |                                                                                         |
      |                                                                                         int64_t {aka long int}
sources/go-bert/bert.cpp/bert.cpp:624:37: warning: format '%lld' expects argument of type 'long long int', but argument 3 has type 'int64_t' {aka 'long int'} [-Wformat=]
  624 |                 printf("%24s - [%5lld, %5lld], type = %6s, %6.2f MB, %9zu bytes\n", name.data(), ne[0], ne[1], ftype_str[ftype], ggml_nbytes(tensor) / 1024.0 / 1024.0, ggml_nbytes(tensor));
      |                                 ~~~~^                                                            ~~~~~
      |                                     |                                                                |
      |                                     long long int                                                    int64_t {aka long int}
      |                                 %5ld
sources/go-bert/bert.cpp/bert.cpp:624:44: warning: format '%lld' expects argument of type 'long long int', but argument 4 has type 'int64_t' {aka 'long int'} [-Wformat=]
  624 |                 printf("%24s - [%5lld, %5lld], type = %6s, %6.2f MB, %9zu bytes\n", name.data(), ne[0], ne[1], ftype_str[ftype], ggml_nbytes(tensor) / 1024.0 / 1024.0, ggml_nbytes(tensor));
      |                                        ~~~~^                                                            ~~~~~
      |                                            |                                                                |
      |                                            long long int                                                    int64_t {aka long int}
      |                                        %5ld
sources/go-bert/bert.cpp/bert.cpp:655:101: warning: format '%llu' expects argument of type 'long long unsigned int', but argument 6 has type 'long unsigned int' [-Wformat=]
  655 |                 fprintf(stderr, "%s: tensor '%s' has wrong size in model file: got %zu, expected %llu\n",
      |                                                                                                  ~~~^
      |                                                                                                     |
      |                                                                                                     long long unsigned int
      |                                                                                                  %lu
  656 |                         __func__, name.data(), ggml_nbytes(tensor), nelements * bpe);
      |                                                                     ~~~~~~~~~~~~~~~                  
      |                                                                               |
      |                                                                               long unsigned int
sources/go-bert/bert.cpp/bert.cpp:692:56: warning: format '%lld' expects argument of type 'long long int', but argument 4 has type 'int64_t' {aka 'long int'} [-Wformat=]
  692 |     printf("%s: mem_per_token %zd KB, mem_per_input %lld MB\n", __func__, new_bert->mem_per_token / (1 << 10), new_bert->mem_per_input / (1 << 20));
      |                                                     ~~~^                                                       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                                                        |                                                                               |
      |                                                        long long int                                                                   int64_t {aka long int}
      |                                                     %ld
# github.com/go-skynet/LocalAI/backend/go/llm/bert
/usr/local/go/pkg/tool/linux_arm64/link: running g++ failed: exit status 1
/usr/bin/ld: cannot find -lcublas: No such file or directory
/usr/bin/ld: cannot find -lcudart: No such file or directory
/usr/bin/ld: cannot find -lcublas: No such file or directory
/usr/bin/ld: cannot find -lcudart: No such file or directory
/usr/bin/ld: cannot find -lcublas: No such file or directory
/usr/bin/ld: cannot find -lcudart: No such file or directory
/usr/bin/ld: cannot find -lcublas: No such file or directory
/usr/bin/ld: cannot find -lcudart: No such file or directory
collect2: error: ld returned 1 exit status

make: *** [Makefile:515: backend-assets/grpc/bert-embeddings] Error 1

FutureProofHomes commented 3 months ago

Okay, I've gotten further.

I updated my Dockerfile to use Nvidia's recommended CUDA base image for Jetson. I also updated ARG CUDA_MAJOR_VERSION to 12 since that is what comes built-in Nvidia's base image. NOTE: I'm only buildling Core.

Dockerfile:

# Set to core only.
ARG IMAGE_TYPE=core

#Use specific Nvidia base image instead of vanilla Ubuntu
#ARG BASE_IMAGE=ubuntu:22.04 
ARG BASE_IMAGE=nvcr.io/nvidia/l4t-cuda:12.2.2-devel-arm64-ubuntu22.04

# extras or core
FROM ${BASE_IMAGE} as requirements-core

USER root

ARG GO_VERSION=1.21.7
ARG BUILD_TYPE
#Set CUDA version to 12 since that is what comes with Nvidia base imaage/Jetson OS.
ARG CUDA_MAJOR_VERSION=12
ARG CUDA_MINOR_VERSION=7
ARG TARGETARCH
ARG TARGETVARIANT

ENV BUILD_TYPE=${BUILD_TYPE}
ENV DEBIAN_FRONTEND=noninteractive
ENV EXTERNAL_GRPC_BACKENDS="coqui:/build/backend/python/coqui/run.sh,huggingface-embeddings:/build/backend/python/se>

ARG GO_TAGS="stablediffusion tinydream tts"

RUN apt-get update && \
    apt-get install -y ca-certificates curl patch pip cmake git && apt-get clean
.
.
.
.

Below is my docker-compose file:

version: '3.6'

services:
  api:
    image: localai:latest
    build:
      context: .
      dockerfile: Dockerfile
      platforms:
        - "linux/arm64"
    ports:
      - 8080:8080
    env_file:
      - .env
    volumes:
      - ./models:/models:cached
      - ./images/:/tmp/generated/images/
    command: ["/usr/bin/local-ai" ]
    runtime: nvidia

Below is my .env file:

## Set number of threads.
## Note: prefer the number of physical cores. Overbooking the CPU degrades performance notably.
# THREADS=14

## Specify a different bind address (defaults to ":8080")
# ADDRESS=127.0.0.1:8080

## Default models context size
# CONTEXT_SIZE=512
#
## Define galleries.
## models will to install will be visible in `/models/available`
# GALLERIES=[{"name":"model-gallery", "url":"github:go-skynet/model-gallery/index.yaml"}]

## CORS settings
# CORS=true
# CORS_ALLOW_ORIGINS=*

## Default path for models
#
MODELS_PATH=/models

## Enable debug mode
DEBUG=true

## Disables COMPEL (Diffusers)
# COMPEL=0

## Enable/Disable single backend (useful if only one GPU is available)
# SINGLE_ACTIVE_BACKEND=true

## Specify a build type. Available: cublas, openblas, clblas.
## cuBLAS: This is a GPU-accelerated version of the complete standard BLAS (Basic Linear Algebra Subprograms) library. It's provided by Nvidia and is part of their CUDA toolkit.
## OpenBLAS: This is an open-source implementation of the BLAS library that aims to provide highly optimized code for various platforms. It includes support for multi-threading and can be compiled to use hardware-specific features for additional performance. OpenBLAS can run on many kinds of hardware, including CPUs from Intel, AMD, and ARM.
## clBLAS:   This is an open-source implementation of the BLAS library that uses OpenCL, a framework for writing programs that execute across heterogeneous platforms consisting of CPUs, GPUs, and other processors. clBLAS is designed to take advantage of the parallel computing power of GPUs but can also run on any hardware that supports OpenCL. This includes hardware from different vendors like Nvidia, AMD, and Intel.
BUILD_TYPE=cublas

## Uncomment and set to true to enable rebuilding from source
REBUILD=true

## Enable go tags, available: stablediffusion, tts
## stablediffusion: image generation with stablediffusion
## tts: enables text-to-speech with go-piper 
## (requires REBUILD=true)
#
# GO_TAGS=stablediffusion

## Path where to store generated images
# IMAGE_PATH=/tmp

## Specify a default upload limit in MB (whisper)
# UPLOAD_LIMIT

## List of external GRPC backends (note on the container image this variable is already set to use extra backends available in extra/)
# EXTERNAL_GRPC_BACKENDS=my-backend:127.0.0.1:9000,my-backend2:/usr/bin/backend.py

### Advanced settings ###
### Those are not really used by LocalAI, but from components in the stack ###
##
### Preload libraries
#LD_PRELOAD=

#LD_PRELOAD=/lib/aarch64-linux-gnu/libGLdispatch.so.0
LD_LIBRARY_PATH=/usr/local/cuda/lib64/
CUDA_LIBPATH=/usr/local/cuda/lib64/

### Huggingface cache for models
# HUGGINGFACE_HUB_CACHE=/usr/local/huggingface

### Python backends GRPC max workers
### Default number of workers for GRPC Python backends.
### This actually controls wether a backend can process multiple requests or not.
# PYTHON_GRPC_MAX_WORKERS=1

### Define the number of parallel LLAMA.cpp workers (Defaults to 1)
# LLAMACPP_PARALLEL=1

### Enable to run parallel requests
# PARALLEL_REQUESTS=true

### Watchdog settings
###
# Enables watchdog to kill backends that are inactive for too much time
# WATCHDOG_IDLE=true
#
# Enables watchdog to kill backends that are busy for too much time
# WATCHDOG_BUSY=true
#
# Time in duration format (e.g. 1h30m) after which a backend is considered idle
# WATCHDOG_IDLE_TIMEOUT=5m
#
# Time in duration format (e.g. 1h30m) after which a backend is considered busy
# WATCHDOG_BUSY_TIMEOUT=5m

And below are my docker logs when my build fails after starting the container for the first time. Notice it fails when building Whisper:

# github.com/go-skynet/go-llama.cpp
binding.cpp: In function 'void llama_binding_free_model(void*)':
binding.cpp:613:5: warning: possible problem detected in invocation of 'operator delete' [-Wdelete-incomplete]
  613 |     delete ctx->model;
      |     ^~~~~~~~~~~~~~~~~
binding.cpp:613:17: warning: invalid use of incomplete type 'struct llama_model'
  613 |     delete ctx->model;
      |            ~~~~~^~~~~
In file included from sources/go-llama-ggml/llama.cpp/examples/common.h:5,
                 from binding.cpp:1:
sources/go-llama-ggml/llama.cpp/llama.h:70:12: note: forward declaration of 'struct llama_model'
   70 |     struct llama_model;
      |            ^~~~~~~~~~~
binding.cpp:613:5: note: neither the destructor nor the class-specific 'operator delete' will be called, even if they are declared when the class is defined
  613 |     delete ctx->model;
      |     ^~~~~~~~~~~~~~~~~
mkdir -p backend-assets/gpt4all
cp: cannot stat 'sources/gpt4all/gpt4all-bindings/golang/buildllm/*.dylib': No such file or directory
cp: cannot stat 'sources/gpt4all/gpt4all-bindings/golang/buildllm/*.dll': No such file or directory
CGO_LDFLAGS="-lcublas -lcudart -L/usr/local/cuda/lib64/" C_INCLUDE_PATH=/build/sources/gpt4all/gpt4all-bindings/golang/ LIBRARY_PATH=/build/sources/gpt4all/gpt4all-bindings/golang/ \
go build -ldflags "-X "github.com/go-skynet/LocalAI/internal.Version=v2.9.0-39-ge022b59" -X "github.com/go-skynet/LocalAI/internal.Commit=e022b5959ea409586bcead3473bbe8c180b9d2bf"" -tags "" -o backend-assets/grpc/gpt4all ./backend/go/llm/gpt4all/
CGO_LDFLAGS="-lcublas -lcudart -L/usr/local/cuda/lib64/" C_INCLUDE_PATH=/build/sources/go-rwkv LIBRARY_PATH=/build/sources/go-rwkv \
go build -ldflags "-X "github.com/go-skynet/LocalAI/internal.Version=v2.9.0-39-ge022b59" -X "github.com/go-skynet/LocalAI/internal.Commit=e022b5959ea409586bcead3473bbe8c180b9d2bf"" -tags "" -o backend-assets/grpc/rwkv ./backend/go/llm/rwkv
cd sources/whisper.cpp && make libwhisper.a
make[1]: Entering directory '/build/sources/whisper.cpp'
I whisper.cpp build info: 
I UNAME_S:  Linux
I UNAME_P:  aarch64
I UNAME_M:  aarch64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -pthread -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/aarch64-linux/include
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -pthread -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/aarch64-linux/include
I LDFLAGS:  -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/aarch64-linux/lib
I CC:       cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
I CXX:      g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

nvcc --forward-unknown-to-host-compiler -arch=native -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -pthread -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/aarch64-linux/include -Wno-pedantic -c ggml-cuda.cu -o ggml-cuda.o
CFLAGS   += -mcpu=native
make[1]: CFLAGS: No such file or directory
make[1]: *** [Makefile:216: ggml-cuda.o] Error 127
make[1]: Leaving directory '/build/sources/whisper.cpp'
make: *** [Makefile:237: sources/whisper.cpp/libwhisper.a] Error

Full Docker Compose Failure Logs Attached: DockerComposeBuildFail.txt

FutureProofHomes commented 3 months ago

It's worth mentioning that if I just run docker compose up -d again that LocalAI starts:

I local-ai build info:
I BUILD_TYPE: cublas
I GO_TAGS: stablediffusion
I LD_FLAGS: -X "github.com/go-skynet/LocalAI/internal.Version=v2.9.0-39-ge022b59" -X "github.com/go-skynet/LocalAI/internal.Commit=e022b5959ea409586bcead3473bbe8c180b9d2bf"
CGO_LDFLAGS="-lcublas -lcudart -L/usr/local/cuda/lib64/" go build -ldflags "-X "github.com/go-skynet/LocalAI/internal.Version=v2.9.0-39-ge022b59" -X "github.com/go-skynet/LocalAI/internal.Commit=e022b5959ea409586bcead3473bbe8c180b9d2bf"" -tags "stablediffusion" -o local-ai ./
7:07AM DBG no galleries to load
7:07AM INF Starting LocalAI using 6 threads, with models path: /models
7:07AM INF LocalAI version: v2.9.0-39-ge022b59 (e022b5959ea409586bcead3473bbe8c180b9d2bf)
7:07AM WRN [startup] failed resolving model '/usr/bin/local-ai'
7:07AM INF Preloading models from /models
7:07AM INF Model name: luna
7:07AM DBG Model: luna (config: {PredictionOptions:{Model:luna-ai-llama2-uncensored.Q6_K.gguf Language: N:0 TopP:0.7 TopK:90 Temperature:0.2 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:luna F16:false Threads:6 Debug:false Roles:map[assistant:ASSISTANT: system:SYSTEM: user:USER:] Embeddings:false Backend:llama TemplateConfig:{Chat:lunademo-chat ChatMessage: Completion:lunademo-completion Edit: Functions:} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName: ParallelCalls:false} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:1 MMap:true MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] TrimSuffix:[] ContextSize:1024 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 MMProj: RopeScaling: ModelType: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:} CUDA:false DownloadFiles:[] Description: Usage:})
7:07AM DBG Extracting backend assets files to /tmp/localai/backend_data
7:07AM INF core/startup process completed!
7:07AM DBG No uploadedFiles file found at /tmp/localai/upload/uploadedFiles.json

 ┌───────────────────────────────────────────────────┐ 
 │                   Fiber v2.50.0                   │ 
 │               http://127.0.0.1:8080               │ 
 │       (bound on host 0.0.0.0 and port 8080)       │ 
 │                                                   │ 
 │ Handlers ........... 105  Processes ........... 1 │ 
 │ Prefork ....... Disabled  PID ............... 167 │ 
 └───────────────────────────────────────────────────┘ 

[127.0.0.1]:54492 200 - GET /readyz

But then when I hit the completions endpoint with:

curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
     "model": "luna",
     "prompt": "How fast is light?",                                                                                    
     "temperature": 0.1 }'

I get the below error:

7:10AM DBG Request received: {"model":"luna","language":"","n":0,"top_p":0,"top_k":0,"temperature":0.1,"max_tokens":0,"echo":false,"batch":0,"f16":false,"ignore_eos":false,"repeat_penalty":0,"n_keep":0,"mirostat_eta":0,"mirostat_tau":0,"mirostat":0,"frequency_penalty":0,"tfz":0,"typical_p":0,"seed":0,"negative_prompt":"","rope_freq_base":0,"rope_freq_scale":0,"negative_prompt_scale":0,"use_fast_tokenizer":false,"clip_skip":0,"tokenizer":"","file":"","response_format":{},"size":"","prompt":"How fast is light?","instruction":"","input":null,"stop":null,"messages":null,"functions":null,"function_call":null,"stream":false,"mode":0,"step":0,"grammar":"","grammar_json_functions":null,"backend":"","model_base_name":""}
7:10AM DBG `input`: &{PredictionOptions:{Model:luna Language: N:0 TopP:0 TopK:0 Temperature:0.1 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Context:context.Background.WithCancel Cancel:0xbaa10 File: ResponseFormat:{Type:} Size: Prompt:How fast is light? Instruction: Input:<nil> Stop:<nil> Messages:[] Functions:[] FunctionCall:<nil> Tools:[] ToolsChoice:<nil> Stream:false Mode:0 Step:0 Grammar: JSONFunctionGrammarObject:<nil> Backend: ModelBaseName:}
7:10AM DBG Parameter Config: &{PredictionOptions:{Model:luna-ai-llama2-uncensored.Q6_K.gguf Language: N:0 TopP:0.7 TopK:90 Temperature:0.1 Maxtokens:0 Echo:false Batch:0 F16:false IgnoreEOS:false RepeatPenalty:0 Keep:0 MirostatETA:0 MirostatTAU:0 Mirostat:0 FrequencyPenalty:0 TFZ:0 TypicalP:0 Seed:0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:luna F16:false Threads:6 Debug:true Roles:map[assistant:ASSISTANT: system:SYSTEM: user:USER:] Embeddings:false Backend:llama TemplateConfig:{Chat:lunademo-chat ChatMessage: Completion:lunademo-completion Edit: Functions:} PromptStrings:[How fast is light?] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName: ParallelCalls:false} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0 MirostatTAU:0 Mirostat:0 NGPULayers:1 MMap:true MMlock:false LowVRAM:false Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] TrimSuffix:[] ContextSize:1024 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 MMProj: RopeScaling: ModelType: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:} CUDA:false DownloadFiles:[] Description: Usage:}
7:10AM DBG Template found, input modified to: Complete the following sentence: How fast is light?

7:10AM INF Loading model 'luna-ai-llama2-uncensored.Q6_K.gguf' with backend llama
7:10AM DBG llama-cpp is an alias of llama-cpp
7:10AM DBG Loading model in memory from file: /models/luna-ai-llama2-uncensored.Q6_K.gguf
7:10AM DBG Loading Model luna-ai-llama2-uncensored.Q6_K.gguf with gRPC (file: /models/luna-ai-llama2-uncensored.Q6_K.gguf) (backend: llama-cpp): {backendString:llama model:luna-ai-llama2-uncensored.Q6_K.gguf threads:6 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0x4000234600 externalBackends:map[autogptq:/build/backend/python/autogptq/run.sh bark:/build/backend/python/bark/run.sh coqui:/build/backend/python/coqui/run.sh diffusers:/build/backend/python/diffusers/run.sh exllama:/build/backend/python/exllama/run.sh exllama2:/build/backend/python/exllama2/run.sh huggingface-embeddings:/build/backend/python/sentencetransformers/run.sh mamba:/build/backend/python/mamba/run.sh petals:/build/backend/python/petals/run.sh sentencetransformers:/build/backend/python/sentencetransformers/run.sh transformers:/build/backend/python/transformers/run.sh transformers-musicgen:/build/backend/python/transformers-musicgen/run.sh vall-e-x:/build/backend/python/vall-e-x/run.sh vllm:/build/backend/python/vllm/run.sh] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:false parallelRequests:false}
7:10AM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/grpc/llama-cpp
7:10AM DBG GRPC Service for luna-ai-llama2-uncensored.Q6_K.gguf will be running at: '127.0.0.1:46509'
7:10AM DBG GRPC Service state dir: /tmp/go-processmanager1266993544
7:10AM DBG GRPC Service Started
7:10AM DBG GRPC(luna-ai-llama2-uncensored.Q6_K.gguf-127.0.0.1:46509): stdout Server listening on 127.0.0.1:46509
7:10AM DBG GRPC Service Ready
7:10AM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:luna-ai-llama2-uncensored.Q6_K.gguf ContextSize:1024 Seed:0 NBatch:512 F16Memory:false MLock:false MMap:true VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:1 MainGPU: TensorSplit: Threads:6 LibrarySearchPath: RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/models/luna-ai-llama2-uncensored.Q6_K.gguf Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 ControlNet: Tokenizer: LoraBase: LoraAdapter: LoraScale:0 NoMulMatQ:false DraftModel: AudioPath: Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0 Type:}
7:10AM DBG GRPC(luna-ai-llama2-uncensored.Q6_K.gguf-127.0.0.1:46509): stderr ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
7:10AM DBG GRPC(luna-ai-llama2-uncensored.Q6_K.gguf-127.0.0.1:46509): stderr ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
7:10AM DBG GRPC(luna-ai-llama2-uncensored.Q6_K.gguf-127.0.0.1:46509): stderr ggml_init_cublas: found 1 CUDA devices:
7:10AM DBG GRPC(luna-ai-llama2-uncensored.Q6_K.gguf-127.0.0.1:46509): stderr   Device 0: Orin, compute capability 8.7, VMM: yes
7:10AM DBG GRPC(luna-ai-llama2-uncensored.Q6_K.gguf-127.0.0.1:46509): stderr CUDA error: the resource allocation failed
7:10AM DBG GRPC(luna-ai-llama2-uncensored.Q6_K.gguf-127.0.0.1:46509): stderr   current device: 0, in function ggml_init_cublas at /build/backend/cpp/llama/llama.cpp/ggml-cuda.cu:8831
7:10AM DBG GRPC(luna-ai-llama2-uncensored.Q6_K.gguf-127.0.0.1:46509): stderr   cublasCreate_v2(&g_cublas_handles[id])
7:10AM DBG GRPC(luna-ai-llama2-uncensored.Q6_K.gguf-127.0.0.1:46509): stderr GGML_ASSERT: /build/backend/cpp/llama/llama.cpp/ggml-cuda.cu:255: !"CUDA error"
[172.19.0.1]:45940 500 - POST /v1/completions

FutureProofHomes commented 3 months ago

@mudler any tips to get this working? Thnx! ^^^^

anthonyjhicks commented 3 months ago

Watching this thread with interest as keen to pick up a Jetson Orin Nano 8GB for low power always available local AI workloads - primarily a voice pipeline from Home Assistant. So hopefully whatever you figure out here @FutureProofHomes will apply to the Orin Nano. Keep at it!

mudler commented 3 months ago

Okay, I've gotten further.

Dockerfile:

# Set to core only.
ARG IMAGE_TYPE=core

#Use specific Nvidia base image instead of vanilla Ubuntu
#ARG BASE_IMAGE=ubuntu:22.04 
ARG BASE_IMAGE=nvcr.io/nvidia/l4t-cuda:12.2.2-devel-arm64-ubuntu22.04

# extras or core
FROM ${BASE_IMAGE} as requirements-core

USER root

ARG GO_VERSION=1.21.7
ARG BUILD_TYPE
#Set CUDA version to 12 since that is what comes with Nvidia base imaage/Jetson OS.
ARG CUDA_MAJOR_VERSION=12
ARG CUDA_MINOR_VERSION=7
ARG TARGETARCH
ARG TARGETVARIANT

ENV BUILD_TYPE=${BUILD_TYPE}
ENV DEBIAN_FRONTEND=noninteractive
ENV EXTERNAL_GRPC_BACKENDS="coqui:/build/backend/python/coqui/run.sh,huggingface-embeddings:/build/backend/python/se>

ARG GO_TAGS="stablediffusion tinydream tts"

RUN apt-get update && \
    apt-get install -y ca-certificates curl patch pip cmake git && apt-get clean
.
.
.
.

Below is my docker-compose file:

version: '3.6'

services:
  api:
    image: localai:latest
    build:
      context: .
      dockerfile: Dockerfile
      platforms:
        - "linux/arm64"
    ports:
      - 8080:8080
    env_file:
      - .env
    volumes:
      - ./models:/models:cached
      - ./images/:/tmp/generated/images/
    command: ["/usr/bin/local-ai" ]
    runtime: nvidia

Below is my .env file:

## Set number of threads.
## Note: prefer the number of physical cores. Overbooking the CPU degrades performance notably.
# THREADS=14

## Specify a different bind address (defaults to ":8080")
# ADDRESS=127.0.0.1:8080

## Default models context size
# CONTEXT_SIZE=512
#
## Define galleries.
## models will to install will be visible in `/models/available`
# GALLERIES=[{"name":"model-gallery", "url":"github:go-skynet/model-gallery/index.yaml"}]

## CORS settings
# CORS=true
# CORS_ALLOW_ORIGINS=*

## Default path for models
#
MODELS_PATH=/models

## Enable debug mode
DEBUG=true

## Disables COMPEL (Diffusers)
# COMPEL=0

## Enable/Disable single backend (useful if only one GPU is available)
# SINGLE_ACTIVE_BACKEND=true

## Specify a build type. Available: cublas, openblas, clblas.
## cuBLAS: This is a GPU-accelerated version of the complete standard BLAS (Basic Linear Algebra Subprograms) library. It's provided by Nvidia and is part of their CUDA toolkit.
## OpenBLAS: This is an open-source implementation of the BLAS library that aims to provide highly optimized code for various platforms. It includes support for multi-threading and can be compiled to use hardware-specific features for additional performance. OpenBLAS can run on many kinds of hardware, including CPUs from Intel, AMD, and ARM.
## clBLAS:   This is an open-source implementation of the BLAS library that uses OpenCL, a framework for writing programs that execute across heterogeneous platforms consisting of CPUs, GPUs, and other processors. clBLAS is designed to take advantage of the parallel computing power of GPUs but can also run on any hardware that supports OpenCL. This includes hardware from different vendors like Nvidia, AMD, and Intel.
BUILD_TYPE=cublas

## Uncomment and set to true to enable rebuilding from source
REBUILD=true

## Enable go tags, available: stablediffusion, tts
## stablediffusion: image generation with stablediffusion
## tts: enables text-to-speech with go-piper 
## (requires REBUILD=true)
#
# GO_TAGS=stablediffusion

## Path where to store generated images
# IMAGE_PATH=/tmp

## Specify a default upload limit in MB (whisper)
# UPLOAD_LIMIT

## List of external GRPC backends (note on the container image this variable is already set to use extra backends available in extra/)
# EXTERNAL_GRPC_BACKENDS=my-backend:127.0.0.1:9000,my-backend2:/usr/bin/backend.py

### Advanced settings ###
### Those are not really used by LocalAI, but from components in the stack ###
##
### Preload libraries
#LD_PRELOAD=

#LD_PRELOAD=/lib/aarch64-linux-gnu/libGLdispatch.so.0
LD_LIBRARY_PATH=/usr/local/cuda/lib64/
CUDA_LIBPATH=/usr/local/cuda/lib64/

### Huggingface cache for models
# HUGGINGFACE_HUB_CACHE=/usr/local/huggingface

### Python backends GRPC max workers
### Default number of workers for GRPC Python backends.
### This actually controls wether a backend can process multiple requests or not.
# PYTHON_GRPC_MAX_WORKERS=1

### Define the number of parallel LLAMA.cpp workers (Defaults to 1)
# LLAMACPP_PARALLEL=1

### Enable to run parallel requests
# PARALLEL_REQUESTS=true

### Watchdog settings
###
# Enables watchdog to kill backends that are inactive for too much time
# WATCHDOG_IDLE=true
#
# Enables watchdog to kill backends that are busy for too much time
# WATCHDOG_BUSY=true
#
# Time in duration format (e.g. 1h30m) after which a backend is considered idle
# WATCHDOG_IDLE_TIMEOUT=5m
#
# Time in duration format (e.g. 1h30m) after which a backend is considered busy
# WATCHDOG_BUSY_TIMEOUT=5m

And below are my docker logs when my build fails after starting the container for the first time. Notice it fails when building Whisper:

# github.com/go-skynet/go-llama.cpp
binding.cpp: In function 'void llama_binding_free_model(void*)':
binding.cpp:613:5: warning: possible problem detected in invocation of 'operator delete' [-Wdelete-incomplete]
  613 |     delete ctx->model;
      |     ^~~~~~~~~~~~~~~~~
binding.cpp:613:17: warning: invalid use of incomplete type 'struct llama_model'
  613 |     delete ctx->model;
      |            ~~~~~^~~~~
In file included from sources/go-llama-ggml/llama.cpp/examples/common.h:5,
                 from binding.cpp:1:
sources/go-llama-ggml/llama.cpp/llama.h:70:12: note: forward declaration of 'struct llama_model'
   70 |     struct llama_model;
      |            ^~~~~~~~~~~
binding.cpp:613:5: note: neither the destructor nor the class-specific 'operator delete' will be called, even if they are declared when the class is defined
  613 |     delete ctx->model;
      |     ^~~~~~~~~~~~~~~~~
mkdir -p backend-assets/gpt4all
cp: cannot stat 'sources/gpt4all/gpt4all-bindings/golang/buildllm/*.dylib': No such file or directory
cp: cannot stat 'sources/gpt4all/gpt4all-bindings/golang/buildllm/*.dll': No such file or directory
CGO_LDFLAGS="-lcublas -lcudart -L/usr/local/cuda/lib64/" C_INCLUDE_PATH=/build/sources/gpt4all/gpt4all-bindings/golang/ LIBRARY_PATH=/build/sources/gpt4all/gpt4all-bindings/golang/ \
go build -ldflags "-X "github.com/go-skynet/LocalAI/internal.Version=v2.9.0-39-ge022b59" -X "github.com/go-skynet/LocalAI/internal.Commit=e022b5959ea409586bcead3473bbe8c180b9d2bf"" -tags "" -o backend-assets/grpc/gpt4all ./backend/go/llm/gpt4all/
CGO_LDFLAGS="-lcublas -lcudart -L/usr/local/cuda/lib64/" C_INCLUDE_PATH=/build/sources/go-rwkv LIBRARY_PATH=/build/sources/go-rwkv \
go build -ldflags "-X "github.com/go-skynet/LocalAI/internal.Version=v2.9.0-39-ge022b59" -X "github.com/go-skynet/LocalAI/internal.Commit=e022b5959ea409586bcead3473bbe8c180b9d2bf"" -tags "" -o backend-assets/grpc/rwkv ./backend/go/llm/rwkv
cd sources/whisper.cpp && make libwhisper.a
make[1]: Entering directory '/build/sources/whisper.cpp'
I whisper.cpp build info: 
I UNAME_S:  Linux
I UNAME_P:  aarch64
I UNAME_M:  aarch64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -pthread -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/aarch64-linux/include
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -pthread -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/aarch64-linux/include
I LDFLAGS:  -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L/targets/aarch64-linux/lib
I CC:       cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
I CXX:      g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

nvcc --forward-unknown-to-host-compiler -arch=native -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -pthread -DGGML_USE_CUBLAS -I/usr/local/cuda/include -I/opt/cuda/include -I/targets/aarch64-linux/include -Wno-pedantic -c ggml-cuda.cu -o ggml-cuda.o
CFLAGS   += -mcpu=native
make[1]: CFLAGS: No such file or directory
make[1]: *** [Makefile:216: ggml-cuda.o] Error 127
make[1]: Leaving directory '/build/sources/whisper.cpp'
make: *** [Makefile:237: sources/whisper.cpp/libwhisper.a] Error

Full Docker Compose Failure Logs Attached: DockerComposeBuildFail.txt

this looks like there is no nvcc binary found. Do you have the cuda bin in your path?

export PATH=/usr/local/cuda/bin/:$PATH

supernerd76 commented 1 month ago

I've solved the build problem on Jetson Orin. At least for a core build. I'll get to the problems with extras - specifically vllm - at the end. First, the issue in the logs from FutureProofHomes is that when whisper.cpp is building, there is a ifneq evaluation in its Makefile that is treating "CFLAGS" as a command rather than a variable. All that section does is add "-mcpu=native" when evaluated as true on aarch64. After a few weeks of beating my head against a brick wall trying to resolve it, I came up with a fix using WHISPER_CPP_VERSION=adee3f9c1faec890eb0c5f3f6f2f73597a8b3962. It's a horrible, ugly hack, but it builds.

Here's the changes that I made to get it to build a docker image:

In .env, set LOCALAI_THREADS=12 and BUILD_TYPE=cublas. Setting BUILD_TYPE will seem redundant, but it acts like a rubber chicken.

In Dockerfile, comment out lines 221-223 (Installing protoc), and replace them with the following to handle aarch64:

RUN if [ "${TARGETARCH}" = "arm64" ]; then \
    curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v26.1/protoc-26.1-linux-aarch_64.zip -o protoc.zip && \
    unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
    rm protoc.zip \
    ; else \
    curl -L -s https://github.com/protocolbuffers/protobuf/releases/download/v26.1/protoc-26.1-linux-x86_64.zip -o protoc.zip && \
    unzip -j -d /usr/local/bin protoc.zip bin/protoc && \
    rm protoc.zip \
; fi

In the top-level Makefile, add this after line 35: export CUDA_ARCH_FLAG=sm_87 Note that this value is specific to the Jetson Orin AGX 64G dev kit. Other Jetson models will have different flags.
Now for the ugliness: Still in the top-level Makefile, add the following line at the end of the 'sources/whisper.cpp' task: sed -i '341,344d' sources/whisper.cpp/Makefile This will completely remove the offending clause from the whisper.cpp makefile. For reference, here's what gets stripped out:
```
ifneq ($(filter aarch64%,$(UNAME_M)),)
CFLAGS   += -mcpu=native
CXXFLAGS += -mcpu=native
endif
```

I ran into this problem even when trying to build whisper.cpp as a standalone Docker container, so it's something with how its Makefile is interpreted by 'docker build'. It's noteworthy that when building outside of Docker, it ran that clause just fine. More research here is needed.

Finally, once all those changes are made, here's the build command that worked: docker build --no-cache --build-arg IMAGE_TYPE=core --build-arg BASE_IMAGE=nvidia/cuda:12.4.1-devel-ubuntu22.04 --build-arg GO_TAGS="stablediffusion tts" --build-arg BUILD_TYPE=cublas --build-arg CUDA_MAJOR_VERSION=12 --build-arg CUDA_MINOR_VERSION=4 --build-arg WHISPER_CPP_VERSION=adee3f9c1faec890eb0c5f3f6f2f73597a8b3962 -t localai_jetson_core .

Trying to build the "extras" image keeps failing at vllm, where it insists that CUDA_HOME is not set. Despite it being set in my .env, Dockerfile, Makefile, and in install.sh - basically everywhere it can possibly be set, I've set it. I'll likely return to that once I've done more testing on the core image and my brain is less puddinglike.

EDIT: Now to solve the next problem: Once built - even with only a llama.cpp backend - it fails to load any models with the below error:

WARNING: failed to read int from file: open /sys/class/drm/card0/device/numa_node: no such file or directory