I have some concerns about this. Based on my experience, GGUF with llama.cpp seems to work differently from transformers, whereas GGML with chatglm.cpp behaves the same as transformers. I haven't yet identified the exact differences. Therefore, an optimization for long-context handling with transformers would be very helpful.
I have some concerns about this. Based on my experience, GGUF with llama.cpp seems to work differently from transformers, whereas GGML with chatglm.cpp behaves the same as transformers. I haven't yet identified the exact differences. Therefore, an optimization for long-context handling with transformers would be very helpful.