Llava 34b issues - Githubissues

dicksensei69 commented 7 months ago

It seems that the Llava 34b doesn't work with the current prompt forms. I'm not certain of this but it seems like what is going on here is some of my outputs.

Screenshot_2024-03-07_15-18-03

It looks like the 34b may take a slightly different prompting than the other models as listed here.

https://github.com/ggerganov/llama.cpp/pull/5267

Using the simple loader it started to spit out Chinese. Kinda a bummer but maybe someone can guide me to a solution. If 34b is just a no go that's ok as well :)

gokayfem commented 7 months ago

can you try this in the prompt

dicksensei69 commented 7 months ago

I'm still getting funny output. Thanks for your help.

Screenshot_2024-03-07_18-57-00

Here is a link to the model that I've been using. It could be nonfunctional? It looks like it is about 10 days older than one ones posted by cjpais on huggingface. https://huggingface.co/cmp-nct/llava-1.6-gguf https://huggingface.co/cjpais/llava-v1.6-34B-gguf/

Finally here is the terminal output. I'm running this on linux mint if it matters and as you can see from the output i have 2x3090. I don't think that mess anything up.

ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   yes
ggml_init_cublas: CUDA_USE_TENSOR_CORES: no
ggml_init_cublas: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
  Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
llama_model_loader: loaded meta data with 23 key-value pairs and 543 tensors from /home/dick/proj/ComfyUI/models/LLavacheckpoints/ggml-yi-34b-f16-q_5_k.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 7168
llama_model_loader: - kv   4:                          llama.block_count u32              = 60
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 20480
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 56
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 5000000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 17
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,64000]   = ["<unk>", "<|startoftext|>", "<|endof...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,64000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,64000]   = [2, 3, 3, 3, 3, 3, 1, 1, 1, 3, 3, 3, ...
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32              = 7
llama_model_loader: - kv  18:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  19:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  20:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  21:                    tokenizer.chat_template str              = {% for message in messages %}{{'<|im_...
llama_model_loader: - kv  22:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  121 tensors
llama_model_loader: - type q5_K:  361 tensors
llama_model_loader: - type q6_K:   61 tensors
llm_load_vocab: mismatch in special tokens definition ( 498/64000 vs 267/64000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 64000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 7168
llm_load_print_meta: n_head           = 56
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 60
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 7
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 20480
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 5000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 30B
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 34.39 B
llm_load_print_meta: model size       = 22.65 GiB (5.66 BPW) 
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<|startoftext|>'
llm_load_print_meta: EOS token        = 7 '<|im_end|>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 315 '<0x0A>'
llm_load_tensors: ggml ctx size       =    0.21 MiB
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: system memory used  = 13024.31 MiB
llm_load_tensors: VRAM used           = 10169.58 MiB
llm_load_tensors: offloading 27 repeating layers to GPU
llm_load_tensors: offloaded 27/61 layers to GPU
...................................................................................................
llama_new_context_with_model: n_ctx      = 320
llama_new_context_with_model: freq_base  = 5000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: VRAM kv self = 33.75 MB
llama_new_context_with_model: KV self size  =   75.00 MiB, K (f16):   37.50 MiB, V (f16):   37.50 MiB
llama_build_graph: non-view tensors processed: 1264/1264
llama_new_context_with_model: compute buffer total size = 90.06 MiB
llama_new_context_with_model: VRAM scratch buffer: 86.88 MiB
llama_new_context_with_model: total VRAM used: 10290.20 MiB (model: 10169.58 MiB, context: 120.62 MiB)
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
Llama.generate: prefix-match hit

llama_print_timings:        load time =    9620.25 ms
llama_print_timings:      sample time =      22.11 ms /    46 runs   (    0.48 ms per token,  2080.98 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time =   47613.29 ms /    46 runs   ( 1035.07 ms per token,     0.97 tokens per second)
llama_print_timings:       total time =   47778.26 ms
Prompt executed in 64.51 seconds

gokayfem commented 7 months ago

i think this model instruction tuned somewhat different than other models. unfortunately i cant try it, my vram is not enough to iterate on this issue.

gokayfem / ComfyUI_VLM_nodes

Llava 34b issues #37