Allocation of 'float inputs_embeds_buf[]' in Int4llamaDecoder::forward() causes Segmentation Fault for inputs longer than 511 tokens

I have been playing around with your awesome implementation and found the following bug:

When I call LLaMaGenerate() with prompts longer than 511 tokens (might also be character limit, I just used token counting for simplicity), the subsequent call to Int4llamaDecoder::forward() causes a segmentation fault upon creation/allocation of inputs_embeds_buf on line 71 in _llama/TinyChatEngine/llm/src/nn_modules/noncuda/Int4llamaDecoder.cc. I believe the issue lies in the stack allocation which might get too large for some prompts. On most systems the stack growth is limited. A stack allocation of a few MBytes can be too large, which is the case here and causes the segmentation fault. I have the following fix: Instead define a vector with the specified size (I think this will be on the heap) like such std::vector<float> inputs_embeds_buf_vec(sqlen * this->embed_dim); and pass the data pointer to the Matrix3D<float> object in the next line Matrix3D<float> inputs_embeds(inputs_embeds_buf_vec.data(), 1, sqlen, this->embed_dim);.

It has worked in my test cases as of right now. Should I post a pull-request?

Edit: I don't know how reproducible this is, as the stack growth limit is architecture dependent according to this stack overflow comment: https://stackoverflow.com/a/1826072

mit-han-lab / TinyChatEngine

Allocation of 'float inputs_embeds_buf[]' in Int4llamaDecoder::forward() causes Segmentation Fault for inputs longer than 511 tokens #88