google / gemma.cpp

lightweight, standalone C++ inference engine for Google's Gemma models.
Apache License 2.0
5.76k stars 487 forks source link

GRPC support - in scope? #52

Open justinsb opened 4 months ago

justinsb commented 4 months ago

I'd like to be able to run gemma.cpp on kubernetes. A first step in my rough plan is to add a client/server mode, and I thought I would add GRPC support. Is the project open to having a contrib directory where we can collaborate on this sort of thing? In future, I'm imagining we could put things like kubernetes manifests in there also.

I have a simple server (though my C++ is not good!) and an example client in golang which I will send as a WIP PR to make the discussion more concrete.

austinvhuang commented 4 months ago

Hi @justinsb thanks for taking the initiative. There's been interest in client/server capabilities and I think there's some obvious use cases + value in that.

There's a few things being worked out that intertwine with such an implementation:

  1. The server piece of this would probably be its own frontend entrypoint (basically in-place of run.cc), how should these alternative frontends be implemented? (I was working on some example demos, but they're on pause while we're traiging this first wave of post-launch PRs/issues)
  2. Should these implementations live in this repo (eg contrib/) or separate repos (like https://github.com/namtranase/gemma-cpp-python)?
  3. There's some minor decoupling code cleanup to better support gemma.cpp-as-a-library / alternative frontend use cases (eg we probably want to decouple the cli argument handling more cleanly).

My suggestion is keep your initial implementation light (some interfaces may change as a result of #3). Can use that as a basis for thinking through design gaps + cleanup needed. A meta q is where to have more involved design discussions w/ community (i've also opened up the discussions tab up top but haven't made use of it yet, may look into a discord).

justinsb commented 4 months ago

There's been interest in client/server capabilities and I think there's some obvious use cases + value in that.

That's great news, and I agree!

There's a few things being worked out that intertwine with such an implementation:

  1. The server piece of this would probably be its own frontend entrypoint (basically in-place of run.cc), how should these alternative frontends be implemented? (I was working on some example demos, but they're on pause while we're traiging this first wave of post-launch PRs/issues)

I had a go, my alternative entrypoint server.cc basically copies and pastes run.cc and has started swapping out functionality so that it read & writes over GRPC instead of stdin/stdout. I will give it a cleanup pass, but I think the core gemma.cpp is very amenable to reuse already and there's not a ton of boilerplate between run.cc and server.cc, so that's a good sign that this is well architected IMO.

  1. Should these implementations live in this repo (eg contrib/) or separate repos (like https://github.com/namtranase/gemma-cpp-python)?

My 2c: putting it into the same repo simplifies any refactoring we want to do as part of (1). For example, if we want to make small changes e.g. to function signatures, or larger changes like supporting batching (I don't think that's there today?). Over time, the core will stabilize and contrib will grow, and we'll probably move things out of contrib into their own repos and encourage more work in other repos - we saw the same pattern in kubernetes. But when the project is getting started, if you want everything to be working, IMO you have to be able to make changes to the whole ecosystem at once, and one repo is the best solution I've found.

I don't think this should discourage people doing things in other repos; rather I think that having some consumers in the same repo acts both as an example and an existence proof.

  1. There's some minor decoupling code cleanup to better support gemma.cpp-as-a-library / alternative frontend use cases (eg we probably want to decouple the cli argument handling more cleanly).

Great - I hope that having a few "consumers" in the repo will help us easily see the impact on different frontends of these changes, because we'll hopefully make any required changes in the same PR. Consumers in other repos can then mirror those changes.

My suggestion is keep your initial implementation light (some interfaces may change as a result of #3). Can use that as a basis for thinking through design gaps + cleanup needed.

Ack - and that's exactly what I'm hoping to achieve by colocating it in this repo.

A meta q is where to have more involved design discussions w/ community

My view is that the best discussions normally happen in the issue comments and PR comments. PRs/code are also part of the conversation, for example maybe we host the GRPC frontend in contrib/ or examples/ while that is still an open area of discussion, but then over time (once the LLM community has converged on some RPC approach) we remove it. One thing we do need in contrib/ or examples/ is some indication that "the code in this directory is not part of the core and might be removed in future".

(i've also opened up the discussions tab up top but haven't made use of it yet, may look into a discord).

You might also consider hosting occasional video meetings ("office hours"), though usually those organically grow from having a few ad-hoc discussions as a core team emerges. I've not personally seen a lot of uptake of the github discussions feature. I know discord is big in the AI community so that might be a good option.