go-skynet / go-llama.cpp

LLama.cpp golang bindings
MIT License
650 stars 79 forks source link

Is the global mutex/callbacks really necessary? #226

Closed cassanof closed 11 months ago

cassanof commented 11 months ago

Title. I don't see why it would, both could just live inside the Llama struct

cassanof commented 11 months ago

FYI I'll be forking the repo and doing some changes -- I really need Go bindings, but I need way more modularity than currently provided. This may become a a hard fork.

mudler commented 11 months ago

PRs are more than welcome, it's an open source project. However nothing forbids you to do an hard fork

cassanof commented 11 months ago

Ok, great. I will push changes soon and open a PR if wanted.

mudler commented 11 months ago

Title. I don't see why it would, both could just live inside the Llama struct

I don't quite follow - are you referring to https://github.com/go-skynet/go-llama.cpp/blob/master/llama.go#L399 ? that's just for the token callback, but agreed can be optimized.

However keep in mind that AFAIK a model can handle a request only at the time up to now, unless things changed recently.

mudler commented 11 months ago

Ok, great. I will push changes soon and open a PR if wanted.

More than happy to accept it 😊

cassanof commented 11 months ago

However keep in mind that AFAIK a model can handle a request only at the time up to now, unless things changed recently.

Ok I see, I was more worried by running two models at once, and one model having to wait for the mutex lock of the other model. I think that a quick and dirty fix is to change the mutex to a rwlock.

mudler commented 11 months ago

However keep in mind that AFAIK a model can handle a request only at the time up to now, unless things changed recently.

Ok I see, I was more worried by running two models at once, and one model having to wait for the mutex lock of the other model. I think that a quick and dirty fix is to change the mutex to a rwlock.

that should work, also a global map would work too, but as you suggested initially it's probably the best approach - however that might not be a quick way to approach it.

In any case I'm fine with either options you pick, feel free to take a stab at it ( just ping me so I will review it asap! )

hiqsociety commented 11 months ago

@cassanof just curious, if this "mutex" is so perf critical, dont u think u should just stick to using c/c++/rust binding instead?

only because of running two models at once?

mind sharing what u are working on? i'm into perf optimization too

cassanof commented 11 months ago

only because of running two models at once?

Yes. When a callback is called the mutex makes it so that other models have to wait for the lock before their callback can be called.

dont u think u should just stick to using c/c++/rust binding instead?

The language you use in this scenario makes such a miniscule difference that it is not really important I think. If you look at the code, it's pretty much just calls to the cpp bindings. Context switching may have some impact, but it's negligible considering the application of this library, which is running LLMs.

mudler commented 11 months ago

Closing as https://github.com/go-skynet/go-llama.cpp/pull/227 was merged