Closed okpatil4u closed 1 year ago
Also, is it possible to run the inference API through curl ? What would be it's structure ?
Sorry for too many comments,
for alpaca, it is stuck at
2023-04-25T11:02:06.569392Z INFO airtifex_api::gen::llm::inference: Loaded tensor 288/291
Is this desired behavior ?
Hello, thanks for checking it out!
Regarding your questions:
admin
. After login you can create/edit accounts in the Users
tab
# save the token
$ curl -H 'Content-Type: application/json' \
-d '{"username":"admin","password":"admin"}' \
http://localhost:6901/api/v1/users/login | jq -r .data.token > auth-token
$ curl -X POST \ -H 'Content-Type: application/json' \ -H "Authorization: Bearer $(cat auth-token)" \ -d '{"prompt": "The capital of France is", "model": "ggml-alpaca-7b-q4", "save": false}' \ http://localhost:6901/api/v1/llm/inference Paris. Its population was estimated at 2,140,526 in January... Budapest is the capital and largest city of Hungary with an approximate population o...
You can check all the parameters for inference here: https://github.com/vv9k/AIrtifex/blob/0f5212becf4addde27da45f98660edc50b0dac68/airtifex-core/src/llm.rs#L98
3. Seems that after recent upgrade of llama-rs it doesn't output last event when the model is loaded. It should work anyways from what I've tested
Thanks,
apparently in airtifex-web you are pointing to 10.0.0.10 ip address. Which doesn't exist.
[[proxy]]
rewrite = "/api/"
#backend = "http://127.0.0.1:6901/api/v1/"
backend = "http://10.0.0.10:6901/api/v1/"
I was able to fix it. And the web ui is working nicely.
But inference is not working, neither through api query nor web system. Can you give me a commit id for llama-rs by any chance for which it works well ?
Oh, right forgot to cleanup the Trunk.toml
, will change this soon.
Looking at the Cargo.lock
I'm using 1b20306da5356dc8eba598e0887aad04240413c6
seems to be the commit id for llama-rs.
Can you send the api configuration here? It can be just the llm models part
Also, what model are you using for inference and what specs your machine has (cpu, ram)?
EDIT:
I just updated Cargo.lock
with latest llama-rs and inference is still working on my side with ggml-alpaca-7b-q4
. Try pulling the latest version of this repo, maybe it will help.
Yes, it is working now. But inference is excruciatingly slow. It took around 1.34 minutes to generate 2 tokens. And it is not working well for multiple concurrent requests (2 minutes each for 4 requests).
I am using M1 Max System with 64 RAM and 10 core processor. The api is using 100% from all the 10 processor cores.
For reference llama-cli generates 2 tokens in less than 1 second.
What kind of system are you using ? Do I need to reduce the number of workers in tokyo ?
Hmm, that is weird. I can run 3 sessions concurrently on a Ryzen 5000 mobile cpu with 12 cores and 32 Gb of ram and it generates token each second for each session.
Even for a single worker it is consuming all the cores. But now it takes 51 seconds. Is this supposed to happen ?
This is how my changed main function looks like.
tokio::runtime::Builder::new_multi_thread()
.worker_threads(1) // adjust the number of worker threads as needed
.enable_all()
.build()
.unwrap(),
);
It doesn't seem normal, here's how it works for me with 3 sessions at the same time:
The GIF is a bit choppy midway because my cpu is sweating a bit :laughing:
I whipped up a linux server on google cloud platform, with 16 core and 16 GB RAM, built the system with rust nightly form scratch. Still it takes 2 minutes. Are you loading the model for every request ?
Try running the API server with RUST_LOG=trace
and try to see at which point it hangs perhaps. This could help us diagnose it further
Also it's normal that even when setting the worker pool to 1 it still consumes more cores as inference handled by llama-rs is multithreaded and it happens outside of Tokio event loop.
Here is the log. It's not stuck, but very slow.
Looking at the memory profile, you load and unload the model after every request. Is that so ?
That doesn't seem right, the model is loaded In a separate thread before loop with requests and it is in memory till the process ends.
Here you can check it in the code https://github.com/vv9k/AIrtifex/blob/be134f4cde792f67985527ef8f22a01dbb618ad0/airtifex-api/src/gen/llm/inference.rs#L135
The InferenceSessionManager loads the model upon creation and it is done before the loop with requests.
It was just a guess. I am using pytorch cpu. Do you want to check the benchmarks on a fresh system by any chance ?
Which benchmarks are you talking about? PyTorch?
I only had a chance to test it on 2 AMD cpus 5000 and 3000 Ryzen but in both cases I'm getting comparable performance to llama-cli from llama-rs. Was this the one you tested against or llama.cpp?
I am getting 4 tokens per second on llama.cpp and similar on llama-rs with -t 1. But I am not able to get even 10th of the speed with this repo.
I was checking if you could try running it on any fresh cloud based VM. That will probably help us debug this problem.
I will try to run it on a few more systems as well. So far I have tested with OS X, M1 Max 64 GB 10 Core Processor and Debian, 16 Core 16 GB RAM systems.
I've spun up a t3.xlarge instance in AWS with 4 cores and 16GB of ram, clean debian 11 install and still cannot reproduce.
Can you share the list of commands you used to set up the instance and rust project ?
Maybe I am missing something.
On Tue, 25 Apr 2023 at 8:43 PM, Wojciech Kępka @.***> wrote:
I've spun up a t3.xlarge instance in AWS with 4 cores and 16GB of ram, clean debian 11 install and still cannot reproduce.
[image: example2] https://user-images.githubusercontent.com/46892771/234322520-3b880b91-8ed7-422e-a70d-0b7c0ececb65.gif
— Reply to this email directly, view it on GitHub https://github.com/vv9k/AIrtifex/issues/2#issuecomment-1521967574, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXGU4DC26AV5QPKZQJES23XC7SYVANCNFSM6AAAAAAXKRKF5E . You are receiving this because you authored the thread.Message ID: @.***>
Oh, maybe you're not building in release mode? That would explain such performance gap. How do you start the server? It should be one of make serve_release
or build it with make build_release
I'm also working on a docker image with docker compose setup so it should be easy to setup and the dockerfile will include detailed step by step setup.
make serve_release reduced overall CPU usage. But still inference is too very slow.
I will wait for the docker file. That should clear out my doubts.
Thanks for your help Wojciech. You are amazing.
Cleaned the cache. Built it again. It is working and pretty fast. Around 3 tokens per second.
Assuming a single threaded operation, which of these variables can I tweak for maximum performance ? How would batch_size and max_inference_sessions impact performance ?
fn default_num_ctx_tokens() -> usize {
1024
}
fn default_batch_size() -> usize {
8
}
fn default_repeat_last_n() -> usize {
64
}
fn default_repeat_penalty() -> f32 {
1.30
}
fn default_temperature() -> f32 {
0.80
}
fn default_top_k() -> usize {
40
}
fn default_top_p() -> f32 {
0.95
}
fn default_max_inference_sessions() -> usize {
4
}
fn default_num_threads() -> usize {
1
}
Closing the issue. Apparently, with this pull request, there should be some bump in the performance.
I haven't really played around performance wise yet so there might be a lot of room for improvements as I'm focusing on functionality/ease of use first.
Will have to create an issue to investigate potential performance improvements.
Hello Wojciech, this looks like an impressive project.
I was able to get it running, but it seems that request to http://localhost:8080/api/users/login is failing.
Do I need to populate data.db ? How do I create new users ?