okpatil4u commented 1 year ago

Hello Wojciech, this looks like an impressive project.

I was able to get it running, but it seems that request to http://localhost:8080/api/users/login is failing.

Do I need to populate data.db ? How do I create new users ?

okpatil4u commented 1 year ago

Also, is it possible to run the inference API through curl ? What would be it's structure ?

okpatil4u commented 1 year ago

Sorry for too many comments,

for alpaca, it is stuck at

2023-04-25T11:02:06.569392Z INFO airtifex_api::gen::llm::inference: Loaded tensor 288/291

Is this desired behavior ?

vv9k commented 1 year ago

Hello, thanks for checking it out!

Regarding your questions:

You can try using the default login and password which is admin. After login you can create/edit accounts in the Users tab

It should be possible to run inference from curl. You'd first have to login through API and save the auth token and later post to appropriate endpoint. Here is an example:


# save the token
$ curl -H 'Content-Type: application/json' \
   -d '{"username":"admin","password":"admin"}' \
   http://localhost:6901/api/v1/users/login | jq -r .data.token > auth-token

run inference

$ curl -X POST \ -H 'Content-Type: application/json' \ -H "Authorization: Bearer $(cat auth-token)" \ -d '{"prompt": "The capital of France is", "model": "ggml-alpaca-7b-q4", "save": false}' \ http://localhost:6901/api/v1/llm/inference Paris. Its population was estimated at 2,140,526 in January... Budapest is the capital and largest city of Hungary with an approximate population o...


You can check all the parameters for inference here: https://github.com/vv9k/AIrtifex/blob/0f5212becf4addde27da45f98660edc50b0dac68/airtifex-core/src/llm.rs#L98

3. Seems that after recent upgrade of llama-rs it doesn't output last event when the model is loaded. It should work anyways from what I've tested

okpatil4u commented 1 year ago

Thanks,

apparently in airtifex-web you are pointing to 10.0.0.10 ip address. Which doesn't exist.

[[proxy]]
rewrite = "/api/"
#backend = "http://127.0.0.1:6901/api/v1/"
backend = "http://10.0.0.10:6901/api/v1/"

I was able to fix it. And the web ui is working nicely.

But inference is not working, neither through api query nor web system. Can you give me a commit id for llama-rs by any chance for which it works well ?

vv9k commented 1 year ago

Oh, right forgot to cleanup the Trunk.toml, will change this soon.

Looking at the Cargo.lock I'm using 1b20306da5356dc8eba598e0887aad04240413c6 seems to be the commit id for llama-rs.

Can you send the api configuration here? It can be just the llm models part

Also, what model are you using for inference and what specs your machine has (cpu, ram)?

EDIT:

I just updated Cargo.lock with latest llama-rs and inference is still working on my side with ggml-alpaca-7b-q4. Try pulling the latest version of this repo, maybe it will help.

okpatil4u commented 1 year ago

Yes, it is working now. But inference is excruciatingly slow. It took around 1.34 minutes to generate 2 tokens. And it is not working well for multiple concurrent requests (2 minutes each for 4 requests).

I am using M1 Max System with 64 RAM and 10 core processor. The api is using 100% from all the 10 processor cores.

For reference llama-cli generates 2 tokens in less than 1 second.

What kind of system are you using ? Do I need to reduce the number of workers in tokyo ?

vv9k commented 1 year ago

Hmm, that is weird. I can run 3 sessions concurrently on a Ryzen 5000 mobile cpu with 12 cores and 32 Gb of ram and it generates token each second for each session.

okpatil4u commented 1 year ago

Even for a single worker it is consuming all the cores. But now it takes 51 seconds. Is this supposed to happen ?

This is how my changed main function looks like.


        tokio::runtime::Builder::new_multi_thread()
            .worker_threads(1) // adjust the number of worker threads as needed
            .enable_all()
            .build()
            .unwrap(),
    );

vv9k commented 1 year ago

It doesn't seem normal, here's how it works for me with 3 sessions at the same time:

example

The GIF is a bit choppy midway because my cpu is sweating a bit :laughing:

okpatil4u commented 1 year ago

I whipped up a linux server on google cloud platform, with 16 core and 16 GB RAM, built the system with rust nightly form scratch. Still it takes 2 minutes. Are you loading the model for every request ?

vv9k commented 1 year ago

Try running the API server with RUST_LOG=trace and try to see at which point it hangs perhaps. This could help us diagnose it further

Also it's normal that even when setting the worker pool to 1 it still consumes more cores as inference handled by llama-rs is multithreaded and it happens outside of Tokio event loop.

okpatil4u commented 1 year ago

Here is the log. It's not stuck, but very slow.

okpatil4u commented 1 year ago

Looking at the memory profile, you load and unload the model after every request. Is that so ?

vv9k commented 1 year ago

That doesn't seem right, the model is loaded In a separate thread before loop with requests and it is in memory till the process ends.

Here you can check it in the code https://github.com/vv9k/AIrtifex/blob/be134f4cde792f67985527ef8f22a01dbb618ad0/airtifex-api/src/gen/llm/inference.rs#L135

The InferenceSessionManager loads the model upon creation and it is done before the loop with requests.

okpatil4u commented 1 year ago

It was just a guess. I am using pytorch cpu. Do you want to check the benchmarks on a fresh system by any chance ?

vv9k commented 1 year ago

Which benchmarks are you talking about? PyTorch?

I only had a chance to test it on 2 AMD cpus 5000 and 3000 Ryzen but in both cases I'm getting comparable performance to llama-cli from llama-rs. Was this the one you tested against or llama.cpp?

okpatil4u commented 1 year ago

I am getting 4 tokens per second on llama.cpp and similar on llama-rs with -t 1. But I am not able to get even 10th of the speed with this repo.

I was checking if you could try running it on any fresh cloud based VM. That will probably help us debug this problem.

I will try to run it on a few more systems as well. So far I have tested with OS X, M1 Max 64 GB 10 Core Processor and Debian, 16 Core 16 GB RAM systems.

vv9k commented 1 year ago

I've spun up a t3.xlarge instance in AWS with 4 cores and 16GB of ram, clean debian 11 install and still cannot reproduce.

example

okpatil4u commented 1 year ago

Can you share the list of commands you used to set up the instance and rust project ?

Maybe I am missing something.

On Tue, 25 Apr 2023 at 8:43 PM, Wojciech Kępka @.***> wrote:

I've spun up a t3.xlarge instance in AWS with 4 cores and 16GB of ram, clean debian 11 install and still cannot reproduce.

[image: example2] https://user-images.githubusercontent.com/46892771/234322520-3b880b91-8ed7-422e-a70d-0b7c0ececb65.gif

— Reply to this email directly, view it on GitHub https://github.com/vv9k/AIrtifex/issues/2#issuecomment-1521967574, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXGU4DC26AV5QPKZQJES23XC7SYVANCNFSM6AAAAAAXKRKF5E . You are receiving this because you authored the thread.Message ID: @.***>

vv9k commented 1 year ago

Oh, maybe you're not building in release mode? That would explain such performance gap. How do you start the server? It should be one of make serve_release or build it with make build_release

I'm also working on a docker image with docker compose setup so it should be easy to setup and the dockerfile will include detailed step by step setup.

okpatil4u commented 1 year ago

make serve_release reduced overall CPU usage. But still inference is too very slow.

I will wait for the docker file. That should clear out my doubts.

Thanks for your help Wojciech. You are amazing.

okpatil4u commented 1 year ago

Cleaned the cache. Built it again. It is working and pretty fast. Around 3 tokens per second.

Assuming a single threaded operation, which of these variables can I tweak for maximum performance ? How would batch_size and max_inference_sessions impact performance ?

fn default_num_ctx_tokens() -> usize {
    1024
}
fn default_batch_size() -> usize {
    8
}
fn default_repeat_last_n() -> usize {
    64
}
fn default_repeat_penalty() -> f32 {
    1.30
}
fn default_temperature() -> f32 {
    0.80
}
fn default_top_k() -> usize {
    40
}
fn default_top_p() -> f32 {
    0.95
}
fn default_max_inference_sessions() -> usize {
    4
}
fn default_num_threads() -> usize {
    1
}

okpatil4u commented 1 year ago

Closing the issue. Apparently, with this pull request, there should be some bump in the performance.

vv9k commented 1 year ago

I haven't really played around performance wise yet so there might be a lot of room for improvements as I'm focusing on functionality/ease of use first.

Will have to create an issue to investigate potential performance improvements.

vv9k / AIrtifex

Creating new users #2

run inference