I have been playing around with your awesome implementation and found the following bug:
When I call LLaMaGenerate() with prompts longer than 511 tokens (might also be character limit, I just used token counting for simplicity), the subsequent call to Int4llamaDecoder::forward() causes a segmentation fault upon creation/allocation of inputs_embeds_buf on line 71 in _llama/TinyChatEngine/llm/src/nn_modules/noncuda/Int4llamaDecoder.cc.
I believe the issue lies in the stack allocation which might get too large for some prompts. On most systems the stack growth is limited. A stack allocation of a few MBytes can be too large, which is the case here and causes the segmentation fault.
I have the following fix: Instead define a vector with the specified size (I think this will be on the heap) like such std::vector<float> inputs_embeds_buf_vec(sqlen * this->embed_dim); and pass the data pointer to the Matrix3D<float> object in the next line Matrix3D<float> inputs_embeds(inputs_embeds_buf_vec.data(), 1, sqlen, this->embed_dim);.
It has worked in my test cases as of right now. Should I post a pull-request?
Edit: I don't know how reproducible this is, as the stack growth limit is architecture dependent according to this stack overflow comment: https://stackoverflow.com/a/1826072
I have been playing around with your awesome implementation and found the following bug:
When I call
LLaMaGenerate()
with prompts longer than 511 tokens (might also be character limit, I just used token counting for simplicity), the subsequent call toInt4llamaDecoder::forward()
causes a segmentation fault upon creation/allocation ofinputs_embeds_buf
on line 71 in _llama/TinyChatEngine/llm/src/nn_modules/noncuda/Int4llamaDecoder.cc. I believe the issue lies in the stack allocation which might get too large for some prompts. On most systems the stack growth is limited. A stack allocation of a few MBytes can be too large, which is the case here and causes the segmentation fault. I have the following fix: Instead define a vector with the specified size (I think this will be on the heap) like suchstd::vector<float> inputs_embeds_buf_vec(sqlen * this->embed_dim);
and pass the data pointer to theMatrix3D<float>
object in the next lineMatrix3D<float> inputs_embeds(inputs_embeds_buf_vec.data(), 1, sqlen, this->embed_dim);
.It has worked in my test cases as of right now. Should I post a pull-request?
Edit: I don't know how reproducible this is, as the stack growth limit is architecture dependent according to this stack overflow comment: https://stackoverflow.com/a/1826072