Error for 32k-64k models

avitos commented 6 months ago

When loading models of 32k-64k tokens an error appears: Error: [Model_Load] A call to an OS function falled.

AMaxContext are 32 or 64 thousand. CPU is used. how to use large context with models that support large context?

Example: https://huggingface.co/NurtureAI/Meta-Llama-3-8B-Instruct-64k-GGUF

jarroddavis68 commented 6 months ago

Hi, how much RAM do you have on your machine? Which model are you trying to use? Remember, it's not just 32k but 32k * SizeOf(Integer) that is being allocated in ram, this will be space for the tokens. This does not include all the allocations needed for inference. There is a lot. I suspect that not enough memory is available.

For example, you can use 1M tokens in Google Gemini because the server has enough RAM + VRAM to support that large context. At present you will not be able to use exceptionally large context size on consumer hardware.

On my RTX-3060 (12GB VRAM) I can only set max_context to about 20k. First make sure the model works by setting max_context to 4k or 8k, then increase until you get an error, which will be the max supported by the resources on your system.

And note that running on CPU will be slow as inference is very mathematically intensive.

Which model are you trying to use?

avitos commented 6 months ago

I'm using the Meta-Llama-3-8B-Instruct-64k.Q4_K_M model.

Okay, I got it. I have 32 gb RAM, occupied by other programs no more than 8 gb. Experimentally I set max_tokens=28000 as acceptable.

But if I set max_tokens to acceptable and try to run the text from the file, then first the executable loads RAM up to 5.9 gb. At about 4gb an error comes out: GGML_ASSERT: cpp\llama\llama.cpp:11616: n_tokens_all <= cparams.n_batch

After reaching 5.9 gb, the RAM utilization drops to 2.4 gb and the console application closes after a few seconds.

This error probably indicates something.

jarroddavis68 commented 6 months ago

GGML_ASSERT: cpp\llama\llama.cpp:11616: n_tokens_all <= cparams.n_batch

yea, I ran into this yesterday myself. I'm in the process of trying to sort it out.

jarroddavis68 commented 6 months ago

test.zip

I've been experimenting with a new API. Try this and see if it works better. This is a very small subset, just enough to test the new inference code.

API is full unicode
New inference code. It should be faster and not suffer from the previous errors. You are still limited by your computer resources, however. You cannot load a model with a context length greater than your available RAM/VRAM.
Its optimized to generate for consistent output, the prompt you send, should continue to generate the same output. This becomes important with tools and agent support, which is planned.
This is a proof-of-concept, so higher level features are not there ATM. The prompt you enter must be wrapped with the chat format supported by the model. In the test project included, you will see templates for phi3 and llama3 models, just uncomment the one or the other. If you use another model, you must wrap the template for it.
If this API works, I will continue to flesh it out.

avitos commented 6 months ago

Yes, the model now loads and works. The only thing is that I changed to work with CPU and specified the correct number of tokens for the context in LME_LoadModel(CModel, 28000, 0); For the example was given the text from the file with the first articles from the U.S. Constitution to understand the essence. The result is excellent, we can see the conclusion.

Great! I will be very much waiting for the API update!

Screenshot_390

jarroddavis68 commented 6 months ago

Wonderful, glad to hear. Thanks! I will continue working on this. 👍🏿

jarroddavis68 commented 5 months ago

test.zip

Ok, here is a full build that you can test if you wish. I'm wrapping things up for a repo update soon. The testbeds show how to use most of the features.

avitos commented 5 months ago

Thank you. Tested it and now all functions work well!

jarroddavis68 commented 5 months ago

Thank you. Tested it and now all functions work well!

Happy to hear. 👍🏿 Note, I've now released update to repo, make sure you grab that one.

tinyBigGAMES / LMEngine

Error for 32k-64k models #3