WIP: Infer with LLaVA-RLHF

This is still WIP

After I implemented the GGUF support in clip.cpp, now it's time to combine clip.cpp + llama.cpp = llava.cpp (the first model to be supported in this repo).

For now, I copy CLIP conversion + model loading + inference code from clip.cpp and make necessary changes. In the future, these changes may be merged upstream and clip.cpp may be a submodule in this repo.

[x] LLaVA surgery: merge base and LoRA weights, strip the multimodal projector.
[x] Convert the LLaMA part with llama.cpp.
[x] Update CLIP conversion script to save a LLaVA encoder model in GGUF.
[x] Load CLIP vision model with LLaVA projector in clip.cpp.
[ ] Update clip_image_encode function to get image hidden states from layers[-2].
[ ] Write a simple example for end-to-end LLaVA infrence.

I think This is enough for the initial release. I will streamline the implementation afterwards.

monatis / lmm.cpp

WIP: Infer with LLaVA-RLHF #2