monatis / clip.cpp

CLIP inference in plain C/C++ with no extra dependencies

MIT License

459 stars 31 forks source link

Write a better readme #4

Closed monatis closed 1 year ago

monatis commented 1 year ago

Demonstrate model conversion, detail how to compile, explain the general API.

Talk about possible usage scenarios, especially the cold start issue.

fire commented 1 year ago

I would like to help write this up. Can you give section headers?

monatis commented 1 year ago

hi @fire,

Thanks for offering a hand in this. My considerations for this issue are as follows:

Extending the motivation section with nots for possible use cases, including edge inference on mobile apps and serverless applications because the cold start is an issue with large frameworks such as Pytorch and TensorFlow.
Welcoming contributions for issues, pull requests and suggestions as discussions.
Being more expressive in the building section --currently it only contains commands.
A more appealing visualization, including a header with, for example, icon for license etc. --unfortunately I'm not a visual guy :D
A new section for usage of our new benchmarking binary described in #23
Anything else you think that can be improved.

Feel free to contribute for any of them.

fire commented 1 year ago

Motivation for the `clip.cpp` Project

CLIP helps computers understand images and text together. It's used in many areas, like when you search for an image online or when a computer needs to describe what's in an image without any help.

What's Special About This Project?

Size: The size of this project is very small, it can use 85.6 MB multi-modal generative models. This means clip.cpp can be used on devices that don't have a lot of storage space.
Startup Time: clip.cpp starts up quickly. This is important because sometimes, programs take a long time to start, especially on servers and phones where starting up quickly is crucial.

fire commented 1 year ago

A more appealing visualization, including a header with, for example, icon for license etc. --unfortunately I'm not a visual guy :D

Would like a video showing the typing command with a png in a terminal and a photo size by side. the result is returned.

Kwisss commented 1 year ago

My understanding is that this could be used with a blip caption model, such as ‘blip-base’, for zero-shot image labeling. Is that correct?

I think this project could gain a lot of traction if we can get ViT-bigG-14 and ViT-L-14/openai working. These are the clip models used for text encoding during sdxl training. (ref)

It would be amazing to get blip-base and blip2-2.7b working. I haven’t looked into the papers to find out which caption model they used.

monatis commented 1 year ago

this could be used with a blip caption model

Yes, BLIB and other large multimodal models are CLIP feature extractor + some bridging mechnism that projects CLIP hidden states to the language model embeddings + a large language model like OPT, Vicuna, T5 etc. This will be another project, see #31

we can get ViT-bigG-14 and ViT-L-14/openai working

Large OpenAI and Open CLIP variants ar already working in this project. But Stable Diffusion is a long story on its own. It's also another project that I want to use clip.cpp in, but yeah, level of traction is also important to devote time on all of zit.

monatis / clip.cpp

Write a better readme #4

Motivation for the clip.cpp Project

What's Special About This Project?

Motivation for the `clip.cpp` Project