Closed avitos closed 6 months ago
Hi, how much RAM do you have on your machine? Which model are you trying to use? Remember, it's not just 32k but 32k * SizeOf(Integer) that is being allocated in ram, this will be space for the tokens. This does not include all the allocations needed for inference. There is a lot. I suspect that not enough memory is available.
For example, you can use 1M tokens in Google Gemini because the server has enough RAM + VRAM to support that large context. At present you will not be able to use exceptionally large context size on consumer hardware.
On my RTX-3060 (12GB VRAM) I can only set max_context to about 20k. First make sure the model works by setting max_context to 4k or 8k, then increase until you get an error, which will be the max supported by the resources on your system.
And note that running on CPU will be slow as inference is very mathematically intensive.
Which model are you trying to use?
I'm using the Meta-Llama-3-8B-Instruct-64k.Q4_K_M model.
Okay, I got it. I have 32 gb RAM, occupied by other programs no more than 8 gb. Experimentally I set max_tokens=28000 as acceptable.
But if I set max_tokens to acceptable and try to run the text from the file, then first the executable loads RAM up to 5.9 gb. At about 4gb an error comes out:
GGML_ASSERT: cpp\llama\llama.cpp:11616: n_tokens_all <= cparams.n_batch
After reaching 5.9 gb, the RAM utilization drops to 2.4 gb and the console application closes after a few seconds.
This error probably indicates something.
GGML_ASSERT: cpp\llama\llama.cpp:11616: n_tokens_all <= cparams.n_batch
yea, I ran into this yesterday myself. I'm in the process of trying to sort it out.
I've been experimenting with a new API. Try this and see if it works better. This is a very small subset, just enough to test the new inference code.
Yes, the model now loads and works.
The only thing is that I changed to work with CPU and specified the correct number of tokens for the context in LME_LoadModel(CModel, 28000, 0);
For the example was given the text from the file with the first articles from the U.S. Constitution to understand the essence. The result is excellent, we can see the conclusion.
Great! I will be very much waiting for the API update!
Wonderful, glad to hear. Thanks! I will continue working on this. 👍🏿
Ok, here is a full build that you can test if you wish. I'm wrapping things up for a repo update soon. The testbeds show how to use most of the features.
Thank you. Tested it and now all functions work well!
Thank you. Tested it and now all functions work well!
Happy to hear. 👍🏿 Note, I've now released update to repo, make sure you grab that one.
When loading models of 32k-64k tokens an error appears: Error: [Model_Load] A call to an OS function falled.
AMaxContext are 32 or 64 thousand. CPU is used. how to use large context with models that support large context?
Example: https://huggingface.co/NurtureAI/Meta-Llama-3-8B-Instruct-64k-GGUF