tinyBigGAMES / LMEngine

Local LLM Inference
BSD 3-Clause "New" or "Revised" License
23 stars 4 forks source link

Error for 32k-64k models #3

Closed avitos closed 6 months ago

avitos commented 6 months ago

When loading models of 32k-64k tokens an error appears: Error: [Model_Load] A call to an OS function falled.

AMaxContext are 32 or 64 thousand. CPU is used. how to use large context with models that support large context?

Example: https://huggingface.co/NurtureAI/Meta-Llama-3-8B-Instruct-64k-GGUF

jarroddavis68 commented 6 months ago

Hi, how much RAM do you have on your machine? Which model are you trying to use? Remember, it's not just 32k but 32k * SizeOf(Integer) that is being allocated in ram, this will be space for the tokens. This does not include all the allocations needed for inference. There is a lot. I suspect that not enough memory is available.

For example, you can use 1M tokens in Google Gemini because the server has enough RAM + VRAM to support that large context. At present you will not be able to use exceptionally large context size on consumer hardware.

On my RTX-3060 (12GB VRAM) I can only set max_context to about 20k. First make sure the model works by setting max_context to 4k or 8k, then increase until you get an error, which will be the max supported by the resources on your system.

And note that running on CPU will be slow as inference is very mathematically intensive.

Which model are you trying to use?

avitos commented 6 months ago

I'm using the Meta-Llama-3-8B-Instruct-64k.Q4_K_M model.

Okay, I got it. I have 32 gb RAM, occupied by other programs no more than 8 gb. Experimentally I set max_tokens=28000 as acceptable.

But if I set max_tokens to acceptable and try to run the text from the file, then first the executable loads RAM up to 5.9 gb. At about 4gb an error comes out: GGML_ASSERT: cpp\llama\llama.cpp:11616: n_tokens_all <= cparams.n_batch

After reaching 5.9 gb, the RAM utilization drops to 2.4 gb and the console application closes after a few seconds.

This error probably indicates something.

jarroddavis68 commented 6 months ago

GGML_ASSERT: cpp\llama\llama.cpp:11616: n_tokens_all <= cparams.n_batch

yea, I ran into this yesterday myself. I'm in the process of trying to sort it out.

jarroddavis68 commented 6 months ago

test.zip

I've been experimenting with a new API. Try this and see if it works better. This is a very small subset, just enough to test the new inference code.

avitos commented 6 months ago

Yes, the model now loads and works. The only thing is that I changed to work with CPU and specified the correct number of tokens for the context in LME_LoadModel(CModel, 28000, 0); For the example was given the text from the file with the first articles from the U.S. Constitution to understand the essence. The result is excellent, we can see the conclusion.

Great! I will be very much waiting for the API update!

Screenshot_390

jarroddavis68 commented 6 months ago

Wonderful, glad to hear. Thanks! I will continue working on this. 👍🏿

jarroddavis68 commented 5 months ago

test.zip

Ok, here is a full build that you can test if you wish. I'm wrapping things up for a repo update soon. The testbeds show how to use most of the features.

avitos commented 5 months ago

Thank you. Tested it and now all functions work well!

jarroddavis68 commented 5 months ago

Thank you. Tested it and now all functions work well!

Happy to hear. 👍🏿 Note, I've now released update to repo, make sure you grab that one.