OpenVINO test on i3-N300 gives Method not implemented

HI! I'm trying to use the OpenVINO support on my machine, I was able to make the model download automatically but I got the error showed down in the log attached.

LocalAI version: quay.io/go-skynet/local-ai:master-sycl-f16-ffmpeg

Environment, CPU architecture, OS, and Version: i3-N300 32GB RAM (it is a N300 not a N305 as showed in the debug logs)

Describe the bug

local-ai-1  | 1:06PM DBG [OAIS GenerateTextFromRequest] Prompt: "A long time ago in a galaxy far, far away"
local-ai-1  | 1:06PM INF Loading model 'fakezeta/Starling-LM-7B-beta-openvino-int8' with backend transformers
local-ai-1  | 1:06PM DBG Model already loaded in memory: fakezeta/Starling-LM-7B-beta-openvino-int8
local-ai-1  | 1:06PM DBG GRPC(fakezeta/Starling-LM-7B-beta-openvino-int8-127.0.0.1:41017): stderr 2024-04-15 13:06:21,325 - grpc._cython.cygrpc - ERROR - Unexpected [NotImplementedError] raised by servicer method [/backend.Backend/TokenizeString]
local-ai-1  | 1:06PM DBG GRPC(fakezeta/Starling-LM-7B-beta-openvino-int8-127.0.0.1:41017): stderr Traceback (most recent call last):
local-ai-1  | 1:06PM DBG GRPC(fakezeta/Starling-LM-7B-beta-openvino-int8-127.0.0.1:41017): stderr   File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 689, in grpc._cython.cygrpc._handle_exceptions
local-ai-1  | 1:06PM DBG GRPC(fakezeta/Starling-LM-7B-beta-openvino-int8-127.0.0.1:41017): stderr   File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 821, in _handle_rpc
local-ai-1  | 1:06PM DBG GRPC(fakezeta/Starling-LM-7B-beta-openvino-int8-127.0.0.1:41017): stderr   File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 554, in _handle_unary_unary_rpc
local-ai-1  | 1:06PM DBG GRPC(fakezeta/Starling-LM-7B-beta-openvino-int8-127.0.0.1:41017): stderr   File "src/python/grpcio/grpc/_cython/_cygrpc/aio/server.pyx.pxi", line 415, in _finish_handler_with_unary_response
local-ai-1  | 1:06PM DBG GRPC(fakezeta/Starling-LM-7B-beta-openvino-int8-127.0.0.1:41017): stderr   File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
local-ai-1  | 1:06PM DBG GRPC(fakezeta/Starling-LM-7B-beta-openvino-int8-127.0.0.1:41017): stderr     result = self.fn(*self.args, **self.kwargs)
local-ai-1  | 1:06PM DBG GRPC(fakezeta/Starling-LM-7B-beta-openvino-int8-127.0.0.1:41017): stderr   File "/build/backend/python/transformers/backend_pb2_grpc.py", line 144, in TokenizeString
local-ai-1  | 1:06PM DBG GRPC(fakezeta/Starling-LM-7B-beta-openvino-int8-127.0.0.1:41017): stderr     raise NotImplementedError('Method not implemented!')
local-ai-1  | 1:06PM DBG GRPC(fakezeta/Starling-LM-7B-beta-openvino-int8-127.0.0.1:41017): stderr NotImplementedError: Method not implemented!

local-ai-1  | 12:51PM DBG Grammar: root-0 ::= "{" space "\"arguments\"" space ":" space root-0-arguments "," space "\"function\"" space ":" space root-0-function "}" space
local-ai-1  | root-1-arguments ::= "{" space "\"message\"" space ":" space string "}" space
local-ai-1  | root-0-arguments-list-item-service-data ::= "{" space "\"entity_id\"" space ":" space string "}" space
local-ai-1  | root-0-arguments-list-item ::= "{" space "\"domain\"" space ":" space string "," space "\"service\"" space ":" space string "," space "\"service_data\"" space ":" space root-0-arguments-list-item-service-data "}" space
local-ai-1  | root-0-arguments-list ::= "[" space (root-0-arguments-list-item ("," space root-0-arguments-list-item)*)? "]" space
local-ai-1  | root-0-arguments ::= "{" space "\"list\"" space ":" space root-0-arguments-list "}" space
local-ai-1  | root-0-function ::= "\"execute_services\""
local-ai-1  | root-1-function ::= "\"answer\""
local-ai-1  | root-1 ::= "{" space "\"arguments\"" space ":" space root-1-arguments "," space "\"function\"" space ":" space root-1-function "}" space
local-ai-1  | root ::= root-0 | root-1
local-ai-1  | space ::= " "?
local-ai-1  | string ::= "\"" (
local-ai-1  |                   [^"\\] |
local-ai-1  |                   "\\" (["\\/bfnrt] | "u" [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F] [0-9a-fA-F])
local-ai-1  |             )* "\"" space
local-ai-1  | 12:51PM INF Loading model 'fakezeta/Starling-LM-7B-beta-openvino-int8' with backend transformers
local-ai-1  | 12:51PM DBG Loading model in memory from file: /models/fakezeta/Starling-LM-7B-beta-openvino-int8
local-ai-1  | 12:51PM DBG Loading Model fakezeta/Starling-LM-7B-beta-openvino-int8 with gRPC (file: /models/fakezeta/Starling-LM-7B-beta-openvino-int8) (backend: transformers): {backendString:transformers model:fakezeta/Starling-LM-7B-beta-openvino-int8 threads:6 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc000426600 externalBackends:map[autogptq:/build/backend/python/autogptq/run.sh bark:/build/backend/python/bark/run.sh coqui:/build/backend/python/coqui/run.sh diffusers:/build/backend/python/diffusers/run.sh exllama:/build/backend/python/exllama/run.sh exllama2:/build/backend/python/exllama2/run.sh huggingface-embeddings:/build/backend/python/sentencetransformers/run.sh mamba:/build/backend/python/mamba/run.sh parler-tts:/build/backend/python/parler-tts/run.sh petals:/build/backend/python/petals/run.sh sentencetransformers:/build/backend/python/sentencetransformers/run.sh transformers:/build/backend/python/transformers/run.sh transformers-musicgen:/build/backend/python/transformers-musicgen/run.sh vall-e-x:/build/backend/python/vall-e-x/run.sh vllm:/build/backend/python/vllm/run.sh] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:false parallelRequests:false}
local-ai-1  | 12:51PM DBG Loading external backend: /build/backend/python/transformers/run.sh
local-ai-1  | 12:51PM DBG Loading GRPC Process: /build/backend/python/transformers/run.sh
local-ai-1  | 12:51PM DBG GRPC Service for fakezeta/Starling-LM-7B-beta-openvino-int8 will be running at: '127.0.0.1:42359'
local-ai-1  | 12:51PM DBG GRPC Service state dir: /tmp/go-processmanager655087151
local-ai-1  | 12:51PM DBG GRPC Service Started
local-ai-1  | 12:51PM DBG GRPC(fakezeta/Starling-LM-7B-beta-openvino-int8-127.0.0.1:42359): stderr /usr/local/lib/python3.10/dist-packages/transformers/utils/hub.py:124: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
local-ai-1  | 12:51PM DBG GRPC(fakezeta/Starling-LM-7B-beta-openvino-int8-127.0.0.1:42359): stderr   warnings.warn(
local-ai-1  | 12:51PM DBG GRPC(fakezeta/Starling-LM-7B-beta-openvino-int8-127.0.0.1:42359): stderr Server started. Listening on: 127.0.0.1:42359
local-ai-1  | 12:51PM DBG GRPC Service Ready
local-ai-1  | 12:51PM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:fakezeta/Starling-LM-7B-beta-openvino-int8 ContextSize:8192 Seed:793163534 NBatch:512 F16Memory:true MLock:false MMap:true VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:99999999 MainGPU: TensorSplit: Threads:6 LibrarySearchPath: RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/models/fakezeta/Starling-LM-7B-beta-openvino-int8 Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 ControlNet: Tokenizer: LoraBase: LoraAdapter: LoraScale:0 NoMulMatQ:false DraftModel: AudioPath: Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0 Type:OVModelForCausalLM}
local-ai-1  | 12:51PM DBG GRPC(fakezeta/Starling-LM-7B-beta-openvino-int8-127.0.0.1:42359): stdout INFO:nncf:NNCF initialized successfully. Supported frameworks detected: torch, onnx, openvino
local-ai-1  | 12:51PM DBG GRPC(fakezeta/Starling-LM-7B-beta-openvino-int8-127.0.0.1:42359): stderr /usr/local/lib/python3.10/dist-packages/transformers/utils/import_utils.py:519: FutureWarning: `is_torch_tpu_available` is deprecated and will be removed in 4.41.0. Please use the `is_torch_xla_available` instead.
local-ai-1  | 12:51PM DBG GRPC(fakezeta/Starling-LM-7B-beta-openvino-int8-127.0.0.1:42359): stderr   warnings.warn(
local-ai-1  | 12:51PM DBG [WatchDog] Watchdog checks for busy connections
local-ai-1  | 12:51PM DBG [WatchDog] 127.0.0.1:42359: active connection
local-ai-1  | 12:51PM DBG GRPC(fakezeta/Starling-LM-7B-beta-openvino-int8-127.0.0.1:42359): stderr Compiling the model to GPU ...
local-ai-1  | 12:51PM DBG GRPC(fakezeta/Starling-LM-7B-beta-openvino-int8-127.0.0.1:42359): stderr Setting OpenVINO CACHE_DIR to /models/models--fakezeta--Starling-LM-7B-beta-openvino-int8/snapshots/0bd9825dc1df4343667c29853745af9c6b7b0186/model_cache
local-ai-1  | 12:51PM DBG GRPC(fakezeta/Starling-LM-7B-beta-openvino-int8-127.0.0.1:42359): stderr Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
local-ai-1  | 12:51PM DBG GRPC(fakezeta/Starling-LM-7B-beta-openvino-int8-127.0.0.1:42359): stderr /usr/local/lib/python3.10/dist-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
local-ai-1  | 12:51PM DBG GRPC(fakezeta/Starling-LM-7B-beta-openvino-int8-127.0.0.1:42359): stderr   warn(
local-ai-1  | 12:51PM DBG [WatchDog] Watchdog checks for busy connections
local-ai-1  | 12:51PM DBG [WatchDog] 127.0.0.1:42359: active connection
local-ai-1  | [127.0.0.1]:40840 200 - GET /readyz
local-ai-1  | 12:52PM DBG [WatchDog] Watchdog checks for busy connections
local-ai-1  | 12:52PM DBG [WatchDog] 127.0.0.1:42359: active connection
local-ai-1  | 12:52PM DBG ss[function] is not OK!
local-ai-1  | 12:52PM DBG [GenerateFromMultipleMessagesChatRequest] fnResultsBranch: []
local-ai-1  | 12:52PM DBG Chat Final Response jsonResult=null

To Reproduce

Docker Compose used

services:
    local-ai:
        tty: true
        stdin_open: true
        devices:
            - /dev/dri/renderD128:/dev/dri/renderD128
#            - /dev/dri/card1:/dev/dri/card1
        ports:
            - 8080:8080
        restart: always
        environment:
            - DEBUG=true
            - MODELS_PATH=/models
            - THREADS=8
            - SINGLE_ACTIVE_BACKEND=false
            - GGML_SYCL_DEVICE=0
            - WATCHDOG_BUSY=true
            - WATCHDOG_BUSY_TIMEOUT=15m
            - ZES_ENABLE_SYSMAN=1

        volumes:
            - ./models:/models
            - ./images:/tmp/generated/images/
        image: quay.io/go-skynet/local-ai:master-sycl-f16-ffmpeg

volumes:
    models:
    photos:

Modle YAML defined in /root/models

name: starling-openvino
backend: transformers
parameters:
  model: fakezeta/Starling-LM-7B-beta-openvino-int8
context_size: 8192
threads: 6
f16: true
type: OVModelForCausalLM
stopwords:
- <|end_of_turn|>
- <|endoftext|>
prompt_cache_path: "cache"
prompt_cache_all: true
template:
  chat_message: |
    {{if eq .RoleName "system"}}{{.Content}}<|end_of_turn|>{{end}}{{if eq .RoleName "assistant"}}<|end_of_turn|>GPT4 Correct Assistant: {{.Content}}<|end_of_turn|>{{end}}{{if eq .RoleName "user"}}GPT4 Correct User: {{.Content}}{{e>

  chat: |
    {{.Input}}<|end_of_turn|>GPT4 Correct Assistant:

  completion: |
    {{.Input}}

Expected behavior Model loads successfully and is possible to generate

Logs

[+] Running 1/0
 ✔ Container root-local-ai-1  Created                                                                                                                                                                                             0.0s
Attaching to local-ai-1
local-ai-1  | @@@@@
local-ai-1  | Skipping rebuild
local-ai-1  | @@@@@
local-ai-1  | If you are experiencing issues with the pre-compiled builds, try setting REBUILD=true
local-ai-1  | If you are still experiencing issues with the build, try setting CMAKE_ARGS and disable the instructions set as needed:
local-ai-1  | CMAKE_ARGS="-DLLAMA_F16C=OFF -DLLAMA_AVX512=OFF -DLLAMA_AVX2=OFF -DLLAMA_FMA=OFF"
local-ai-1  | see the documentation at: https://localai.io/basics/build/index.html
local-ai-1  | Note: See also https://github.com/go-skynet/LocalAI/issues/288
local-ai-1  | @@@@@
local-ai-1  | CPU info:
local-ai-1  | model name        : Intel(R) Core(TM) i3-N305
local-ai-1  | flags             : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l2 cdp_l2 ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdt_a rdseed adx smap clflushopt clwb intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves split_lock_detect avx_vnni dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req vnmi umip pku ospke waitpkg gfni vaes vpclmulqdq rdpid movdiri movdir64b fsrm md_clear serialize arch_lbr ibt flush_l1d arch_capabilities
local-ai-1  | CPU:    AVX    found OK
local-ai-1  | CPU:    AVX2   found OK
local-ai-1  | CPU: no AVX512 found
local-ai-1  | @@@@@
local-ai-1  | 12:46PM INF loading environment variables from file envFile=.env
local-ai-1  | 12:46PM INF Setting logging to debug
local-ai-1  | 12:46PM INF Starting LocalAI using 8 threads, with models path: /models
local-ai-1  | 12:46PM INF LocalAI version: b739cbb (b739cbb86b9734bd62d4f63fad6583cf97059ea5)
local-ai-1  | 12:46PM DBG No configuration file found at /tmp/localai/upload/uploadedFiles.json
local-ai-1  | 12:46PM DBG No configuration file found at /tmp/localai/config/assistants.json
local-ai-1  | 12:46PM DBG No configuration file found at /tmp/localai/config/assistantsFile.json
local-ai-1  | 12:46PM INF Preloading models from /models
local-ai-1  |
local-ai-1  |   Model name: starling-openvino
local-ai-1  |
local-ai-1  |
local-ai-1  | 12:46PM DBG Model: starling-openvino (config: {PredictionOptions:{Model:fakezeta/Starling-LM-7B-beta-openvino-int8 Language: N:0 TopP:0xc00037f040 TopK:0xc00037f048 Temperature:0xc00037f050 Maxtokens:0xc00037f058 Echo:false Batch:0 IgnoreEOS:false RepeatPenalty:0 Keep:0 FrequencyPenalty:0 PresencePenalty:0 TFZ:0xc00037f080 TypicalP:0xc00037f078 Seed:0xc00037f0a8 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:starling-openvino F16:0xc00037efe0 Threads:0xc00037efc8 Debug:0xc00037f0a0 Roles:map[] Embeddings:false Backend:transformers TemplateConfig:{Chat:{{.Input}}<|end_of_turn|>GPT4 Correct Assistant:
local-ai-1  |  ChatMessage:{{if eq .RoleName "system"}}{{.Content}}<|end_of_turn|>{{end}}{{if eq .RoleName "assistant"}}<|end_of_turn|>GPT4 Correct Assistant: {{.Content}}<|end_of_turn|>{{end}}{{if eq .RoleName "user"}}GPT4 Correct User: {{.Content}}{{end}}
local-ai-1  |  Completion:{{.Input}}
local-ai-1  |  Edit: Functions: UseTokenizerTemplate:false} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: FunctionsConfig:{DisableNoAction:false NoActionFunctionName: NoActionDescriptionName: ParallelCalls:false} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath:cache PromptCacheAll:true PromptCacheRO:false MirostatETA:0xc00037f070 MirostatTAU:0xc00037f068 Mirostat:0xc00037f060 NGPULayers:0xc00037f088 MMap:0xc00037f0a0 MMlock:0xc00037f0a1 LowVRAM:0xc00037f0a1 Grammar: StopWords:[<|end_of_turn|> <|endoftext|>] Cutstrings:[] TrimSpace:[] TrimSuffix:[] ContextSize:0xc00037efa8 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 MMProj: RopeScaling: ModelType:OVModelForCausalLM YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} VallE:{AudioPath:} CUDA:false DownloadFiles:[] Description: Usage:})
local-ai-1  | 12:46PM DBG Extracting backend assets files to /tmp/localai/backend_data
local-ai-1  | 12:46PM INF core/startup process completed!
local-ai-1  | 12:46PM INF [WatchDog] starting watchdog
local-ai-1  |
local-ai-1  |  ┌───────────────────────────────────────────────────┐
local-ai-1  |  │                   Fiber v2.52.4                   │
local-ai-1  |  │               http://127.0.0.1:8080               │
local-ai-1  |  │       (bound on host 0.0.0.0 and port 8080)       │
local-ai-1  |  │                                                   │
local-ai-1  |  │ Handlers ........... 181  Processes ........... 1 │
local-ai-1  |  │ Prefork ....... Disabled  PID ................. 1 │
local-ai-1  |  └───────────────────────────────────────────────────┘
local-ai-1  |

Additional context Query Used

root@localai:~# curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{
     "model": "starling-openvino",
     "prompt": "A long time ago in a galaxy far, far away",
     "temperature": 0.7
   }'

cc @fakezeta can you confirm with your setup? I didn't gave a shot at OpenVINO locally here yet as I'm using sycl, otherwise will have a look later in the weekend

Hi all, I'm currently on vacation with my family this week and as such I've limited capabilities to support (I'm only with mobile phone).

From the last log I saw no error. @gericho can you kindly do a test adding "stream": true to the json?

In the code streaming and no streaming generation are handled differently.

Certainly, I can try that. However, to provide a more comprehensive solution, could you please provide the specific code or configuration file where the "stream" option should be added? I don't want to make unnecessary changes or modify the wrong file. Thank you.

So I tried with the stream option and got letter by letter inference as expected. The CPU usage is around 10-15%, the performance is way worse than @fakezeta showed here on reddit.

Here the command with the "stream" option I just used: curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{ "stream": true, "model": "starling-openvino", "prompt": "A long time ago in a galaxy far, far away", "temperature": 0.7 }

See here fakezeta CPU specs See here my CPU specs

@fakezeta said his localai is running on a VM, mine is running on Proxmox passing the iGPU to the LXC then to the Docker Container. So theoretically, seen the CPUs specs, it have to run almost at the same speed if not faster on the i3 N300.

Here my LXC status when working, the RAM usage way lower than the model size itself :

Screenshot 2024-04-15 195247


root@localai:~# vainfo
error: can't connect to X server!
libva info: VA-API version 1.17.0
libva info: Trying to open /usr/lib/x86_64-linux-gnu/dri/iHD_drv_video.so
libva info: Found init function __vaDriverInit_1_17
libva info: va_openDriver() returns 0
vainfo: VA-API version: 1.17 (libva 2.12.0)
vainfo: Driver version: Intel iHD driver for Intel(R) Gen Graphics - 23.1.1 ()
vainfo: Supported profile and entrypoints
      VAProfileNone                   : VAEntrypointVideoProc
      VAProfileNone                   : VAEntrypointStats
      VAProfileMPEG2Simple            : VAEntrypointVLD
      VAProfileMPEG2Main              : VAEntrypointVLD
      VAProfileH264Main               : VAEntrypointVLD
      VAProfileH264Main               : VAEntrypointEncSliceLP
      VAProfileH264High               : VAEntrypointVLD
      VAProfileH264High               : VAEntrypointEncSliceLP
      VAProfileJPEGBaseline           : VAEntrypointVLD
      VAProfileJPEGBaseline           : VAEntrypointEncPicture
      VAProfileH264ConstrainedBaseline: VAEntrypointVLD
      VAProfileH264ConstrainedBaseline: VAEntrypointEncSliceLP
      VAProfileVP8Version0_3          : VAEntrypointVLD
      VAProfileHEVCMain               : VAEntrypointVLD
      VAProfileHEVCMain               : VAEntrypointEncSliceLP
      VAProfileHEVCMain10             : VAEntrypointVLD
      VAProfileHEVCMain10             : VAEntrypointEncSliceLP
      VAProfileVP9Profile0            : VAEntrypointVLD
      VAProfileVP9Profile0            : VAEntrypointEncSliceLP
      VAProfileVP9Profile1            : VAEntrypointVLD
      VAProfileVP9Profile1            : VAEntrypointEncSliceLP
      VAProfileVP9Profile2            : VAEntrypointVLD
      VAProfileVP9Profile2            : VAEntrypointEncSliceLP
      VAProfileVP9Profile3            : VAEntrypointVLD
      VAProfileVP9Profile3            : VAEntrypointEncSliceLP
      VAProfileHEVCMain12             : VAEntrypointVLD
      VAProfileHEVCMain422_10         : VAEntrypointVLD
      VAProfileHEVCMain422_12         : VAEntrypointVLD
      VAProfileHEVCMain444            : VAEntrypointVLD
      VAProfileHEVCMain444            : VAEntrypointEncSliceLP
      VAProfileHEVCMain444_10         : VAEntrypointVLD
      VAProfileHEVCMain444_10         : VAEntrypointEncSliceLP
      VAProfileHEVCMain444_12         : VAEntrypointVLD
      VAProfileHEVCSccMain            : VAEntrypointVLD
      VAProfileHEVCSccMain            : VAEntrypointEncSliceLP
      VAProfileHEVCSccMain10          : VAEntrypointVLD
      VAProfileHEVCSccMain10          : VAEntrypointEncSliceLP
      VAProfileHEVCSccMain444         : VAEntrypointVLD
      VAProfileHEVCSccMain444         : VAEntrypointEncSliceLP
      VAProfileAV1Profile0            : VAEntrypointVLD
      VAProfileHEVCSccMain444_10      : VAEntrypointVLD
      VAProfileHEVCSccMain444_10      : VAEntrypointEncSliceLP

Please note, it is running over 6000 tokens and still counting... how can I limit that?

EDIT: Just finished, here the log

local-ai-1  | 6:09PM DBG completion streaming sending chunk: {"choices":[{"index":0,"finish_reason":"","text":"J"}],"usage":{"prompt_tokens":0,"completion_tokens":7061,"total_tokens":7061}}
local-ai-1  |
local-ai-1  | 6:09PM DBG completion streaming sending chunk: {"choices":[{"index":0,"finish_reason":"","text":"o"}],"usage":{"prompt_tokens":0,"completion_tokens":7062,"total_tokens":7062}}
local-ai-1  |
local-ai-1  | 6:09PM DBG completion streaming sending chunk: {"choices":[{"index":0,"finish_reason":"","text":"h"}],"usage":{"prompt_tokens":0,"completion_tokens":7063,"total_tokens":7063}}
local-ai-1  |
local-ai-1  | 6:09PM DBG completion streaming sending chunk: {"choices":[{"index":0,"finish_reason":"","text":"n"}],"usage":{"prompt_tokens":0,"completion_tokens":7064,"total_tokens":7064}}
local-ai-1  |
local-ai-1  | 6:09PM DBG completion streaming sending chunk: {"choices":[{"index":0,"finish_reason":"","text":" "}],"usage":{"prompt_tokens":0,"completion_tokens":7065,"total_tokens":7065}}
local-ai-1  |
local-ai-1  | 6:09PM DBG completion streaming sending chunk: {"choices":[{"index":0,"finish_reason":"","text":"W"}],"usage":{"prompt_tokens":0,"completion_tokens":7066,"total_tokens":7066}}
local-ai-1  |
local-ai-1  | 6:09PM DBG completion streaming sending chunk: {"choices":[{"index":0,"finish_reason":"","text":"i"}],"usage":{"prompt_tokens":0,"completion_tokens":7067,"total_tokens":7067}}
local-ai-1  |
local-ai-1  | 6:09PM DBG completion streaming sending chunk: {"choices":[{"index":0,"finish_reason":"","text":"l"}],"usage":{"prompt_tokens":0,"completion_tokens":7068,"total_tokens":7068}}
local-ai-1  |
local-ai-1  | 6:09PM DBG completion streaming sending chunk: {"choices":[{"index":0,"finish_reason":"","text":"l"}],"usage":{"prompt_tokens":0,"completion_tokens":7069,"total_tokens":7069}}
local-ai-1  |
local-ai-1  | 6:09PM DBG completion streaming sending chunk: {"choices":[{"index":0,"finish_reason":"","text":"i"}],"usage":{"prompt_tokens":0,"completion_tokens":7070,"total_tokens":7070}}
local-ai-1  |
local-ai-1  | 6:09PM DBG completion streaming sending chunk: {"choices":[{"index":0,"finish_reason":"","text":"a"}],"usage":{"prompt_tokens":0,"completion_tokens":7071,"total_tokens":7071}}
local-ai-1  |
local-ai-1  | 6:09PM DBG completion streaming sending chunk: {"choices":[{"index":0,"finish_reason":"","text":"m"}],"usage":{"prompt_tokens":0,"completion_tokens":7072,"total_tokens":7072}}
local-ai-1  |
local-ai-1  | 6:09PM DBG completion streaming sending chunk: {"choices":[{"index":0,"finish_reason":"","text":"s"}],"usage":{"prompt_tokens":0,"completion_tokens":7073,"total_tokens":7073}}
local-ai-1  |
local-ai-1  | 6:09PM DBG completion streaming sending chunk: {"choices":[{"index":0,"finish_reason":"","text":":"}],"usage":{"prompt_tokens":0,"completion_tokens":7074,"total_tokens":7074}}

UPDATE: As per @fakezeta advice, since the performance on the smaller i3 N300 with an int8 model gives not usable results, I just tried, and got way faster results, with the Starling int4 model! I'm still not able to limit the max token output BTW

Here the link to the model used fakezeta/Starling-LM-7B-beta-openvino-int4

@fakezeta advice also to try a OLLAMA or llama.cpp GGUF model, that could be faster. I will update this post eventually.

@mudler there could be something in the non-streaming generation since streaming is working.

I'll look at it when I get home in the weekend.

@gericho Ollama is using the same llama.cpp the LocalAI is using for GGUF but on your machine I expect OpenVINO 4bit quantization to be faster.

i3-N300 has a 32 EU GPU like mine but could be power limited since it has a TDP of only 7 watt vs 65 of mine.

https://ark.intel.com/content/www/us/en/ark/products/231806/intel-core-i3-n300-processor-6m-cache-up-to-3-80-ghz.html

Ram speed is also important, what kind of Ram do you have? I use DDR5-5600

@mudler there could be something in the non-streaming generation since streaming is working.

I'll look at it when I get home in the weekend.

@gericho Ollama is using the same llama.cpp the LocalAI is using for GGUF but on your machine I expect OpenVINO 4bit quantization to be faster.

i3-N300 has a 32 EU GPU like mine but could be power limited since it has a TDP of only 7 watt vs 65 of mine.

https://ark.intel.com/content/www/us/en/ark/products/231806/intel-core-i3-n300-processor-6m-cache-up-to-3-80-ghz.html

Ram speed is also important, what kind of Ram do you have? I use DDR5-5600

32GB DDR5 4800MHz CL40

@fakezeta thank you for pointing out some more OpenVINO GGUF models from https://huggingface.co/helenai

EDIT: just to be clear, OpenVINO does not support GGUF. Is llama.cpp that uses GGUF models and has sycl acceleration for iGPU. So if I'm correct, only LocalAI docker images like this one quay.io/go-skynet/local-ai:master-sycl-f16-ffmpeg have acceleration support to GGUF models in OpenVINO.

Just an update with some performance data on the i3 N300 8 cores 32 GB single-channel 4800Mhz RAM (7W TDP), it may be fair to check also if possible to tweak the bios since this machine was born fanless but can actually accept a 70mm fan:

Docker Image used: quay.io/go-skynet/local-ai:master-sycl-f16-ffmpeg

Model fakezeta/Starling-LM-7B-beta-openvino-int4

trial 1 : 611 tokens 33 secs 18,5tk/s
trial 2 : 517/30 17.23tk/s
trial 3 : 584/32 18,25tk/s

around 18tk/s

Model fakezeta/Starling-LM-7B-beta-openvino-int8

trial 1 : 583 tokens 40 secs 14,5tk/s
trial 2 : 539/38 14,18tk/s
trial 3 : 618/41 15,07tk/s

around 14,5tk/s

Note these trials were timed manually, so the whole chain is evaluated: Prompt evaluation, Token generation and Token decoding. Note also, this machine is a Proxmox 8 host, and it runs simultaneously a Home Assistant VM and some LXC, 3 LCXs are sharing OpenVINO and VAAPI backends too!

The idea is to give Home Assistant a conversation-agent, so it's important the model can handle functions rendering but not necessarily (or mainly/primarly) chitchat responses. So the given models mentioned above are way overkill for this IMHO.

Back at home and working on this thanks to @gericho availability on direct chat. Happy to see that performance are consistent with my testing HW and it confirms my idea of OpenVINO as a viable solution for edge use cases.

I tryed the curl command above: curl http://localhost:8080/v1/completions -H "Authorization: Bearer REDACTED" -H "Content-Type: application/json" -d '{ "stream": false, "model": "gpt-3.5-turbo", "max_tokens": 15, "prompt": "Why the grass is green?", "temperature": 0.7 }'

output is {"created":1713541361,"object":"text_completion","id":"bbea4f56-de56-495c-b696-60467772558d","model":"gpt-3.5-turbo","choices":[{"index":0,"finish_reason":"stop","text":"\n\nThe grass appears green due to the presence of a pigment called"}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}

with this model definition file:

name: gpt-3.5-turbo
backend: transformers
parameters:
  model: fakezeta/Starling-LM-7B-beta-openvino-int8
context_size: 8192
threads: 6
f16: true
type: OVModelForCausalLM
stopwords:
- <|end_of_turn|>
- <|endoftext|>
prompt_cache_path: "cache"
prompt_cache_all: true
template:
  chat_message: |
    {{if eq .RoleName "system"}}{{.Content}}<|end_of_turn|>{{end}}{{if eq .RoleName "assistant"}}<|end_of_turn|>GPT4 Correct Assistant: {{.Content}}<|end_of_turn|>{{end}}{{if eq .RoleName "user"}}GPT4 Correct User: {{.Content}}{{end}}

  chat: |
    {{.Input}}<|end_of_turn|>GPT4 Correct Assistant:

  completion: |
    {{.Input}}

The Authorization: Bearer header can be removed, I use it because I have LocalAI exposed to internet. So both non-streaming and max tokens seems to work.

I asked @gericho to try this command and report feedback.

Just a comment to say that progress on Openvino has brought me back to LocalAI, so thank you for that. I had experimented with other intel openAI api solutions like BigDL-LLM with FastChat and IPEX/NeuralChat but usually abandoned them for performance or support reasons.

Got it working today with Starling-LM-7B-beta-openvino-int4, and solid performance on an i5-1135G7.

Currently testing with aless2212/Mistral-7B-v0.2-openvino-int4-cpu - performance seems as good, but cannot for the life of me make the chat template work - keep getting non-sense responses. Will keep trying.

Thank you for your kind words, which truly warm my heart, especially considering the late-night hours I’ve spent on the transformer backend.

Regarding the chat template issues, the pull request #2090 should provide a solution. You will now be able to utilize the chat template included in the tokenizer_config.json file.

The model definition has been streamlined to the following:

name: mistral
backend: transformers
parameters:
  model: aless2212/Mistral-7B-v0.2-openvino-int4-cpu
context_size: 8192
threads: 6
f16: true
type: OVModelForCausalLM
template:
  use_tokenizer_template: true

Just a comment to say that progress on Openvino has brought me back to LocalAI, so thank you for that. I had experimented with other intel openAI api solutions like BigDL-LLM with FastChat and IPEX/NeuralChat but usually abandoned them for performance or support reasons.

Got it working today with Starling-LM-7B-beta-openvino-int4, and solid performance on an i5-1135G7.

Currently testing with aless2212/Mistral-7B-v0.2-openvino-int4-cpu - performance seems as good, but cannot for the life of me make the chat template work - keep getting non-sense responses. Will keep trying.

Just updated to the latest docker image. Suspect it might be an issue with the model itself.

Config:

name: mistral-openvino
backend: transformers
parameters:
  model: aless2212/Mistral-7B-v0.2-openvino-int4-cpu
context_size: 8192
threads: 8
f16: true
type: OVModelForCausalLM
prompt_cache_path: "cache"
prompt_cache_all: true
template:
  use_tokenizer_template: true

Curl test:


curl http://localhost:8080/v1/completions -H "Content-Type: application/json" -d '{ "stream": false, "model": "mistral-openvino", "max_tokens": 20, "prompt": "Tell me a story", "temperature": 0.7 }'

Response:

{"created":1713694939,"object":"text_completion","id":"a53748bc-c52a-48c8-8f50-a3115fd8fd01","model":"mistral-openvino","choices":[{"index":0,"finish_reason":"stop","text":"The as in redet #A.W. The more… Universityof1 # is a."}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}

@nickp27 the PR is in his journey to be merged: I just opened it yesterday 😄

I tested this model for you with my local build and I think there is something not working with it: curl http://localhost:8080/v1/completions -H "Authorization: Bearer c969ac29" -H "Content-Type: application/json" -d '{ "stream": false, "model": "mistral-openvino", "max_tokens": 100, "prompt": "Tell me a story", "temperature": 0.7 }'

{"created":1713696507,"object":"text_completion","id":"b7929f05-2012-4300-b2ba-6aee6e7df81e","model":"mistral-openvino","choices":[{"index":0,"finish_reason":"stop","text":"in #101 ( Question 391 ( User Question : L3, #17390020. The Goddubiquest_120,)\n, 2010riding for you\n Q: Q: #1 Question: #w by by bytown_ Question marks of Question of course Question Marker ( Questionnai T1 # Q3 # QEar- Question. Question"}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}

Also using the mistral template explicitly

template:
  chat: &chat |
    [INST] {{.Input}} [/INST]    
  completion: *chat

gives garbage: {"created":1713697087,"object":"text_completion","id":"34a6c693-635b-4c29-a22b-c3c0109485c4","model":"mistral-openvino","choices":[{"index":0,"finish_reason":"stop","text":"\nOnce orheresaid: weit, eranotherwiseashe mostimmediate. rodznaive, sostra in a new!\n QEP: a20- Question: mice,* Question: I's Question: the numbers Question—\n QEs more siendo_ Question: Q miemage^{ Question Question of Question: Q8 #dan a Question: Q Fuß est- Q: Q\\ Question: Q6 \" Question: Question "}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}

Good shout - i'll go back to your Starling model for now.

@nickp27 the PR is in his journey to be merged: I just opened it yesterday 😄

I tested this model for you with my local build and I think there is something not working with it: curl http://localhost:8080/v1/completions -H "Authorization: Bearer c969ac29" -H "Content-Type: application/json" -d '{ "stream": false, "model": "mistral-openvino", "max_tokens": 100, "prompt": "Tell me a story", "temperature": 0.7 }'

{"created":1713696507,"object":"text_completion","id":"b7929f05-2012-4300-b2ba-6aee6e7df81e","model":"mistral-openvino","choices":[{"index":0,"finish_reason":"stop","text":"in #101 ( Question 391 ( User Question : L3, #17390020. The Goddubiquest_120,)\n, 2010riding for you\n Q: Q: #1 Question: #w by by bytown_ Question marks of Question of course Question Marker ( Questionnai T1 # Q3 # QEar- Question. Question"}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}

Also using the mistral template explicitly
template:
  chat: &chat |
    [INST] {{.Input}} [/INST]    
  completion: *chat
gives garbage: {"created":1713697087,"object":"text_completion","id":"34a6c693-635b-4c29-a22b-c3c0109485c4","model":"mistral-openvino","choices":[{"index":0,"finish_reason":"stop","text":"\nOnce orheresaid: weit, eranotherwiseashe mostimmediate. rodznaive, sostra in a new!\n QEP: a20- Question: mice,* Question: I's Question: the numbers Question—\n QEs more siendo_ Question: Q miemage^{ Question Question of Question: Q8 #dan a Question: Q Fuß est- Q: Q\\ Question: Q6 \" Question: Question "}],"usage":{"prompt_tokens":0,"completion_tokens":0,"total_tokens":0}}

PR merged, it's in the latest docker build local-ai:master-sycl-f16-ffmpeg

On my HF you can find some other models like WizardLM2 and also Llama3. Beware that I couldn't get Llama3 to work properly: still have to understand if it's a configuration issue (like stopwords and chat template) or a transformer library issue fixed in 4.40 that I cannot use since optimum-intel supports up to 4.39. If you have time to investigate you're welcome 😄

Since this issue it's becoming more a general discussion on OpenVINO why don't close this issue and continue on a discussion thread?

mudler / LocalAI

OpenVINO test on i3-N300 gives Method not implemented #2034