Closed mrT23 closed 4 months ago
the 20 and 34b used a position embedding matrix of fixed size (8192) and the error you are seeing is the error thrown in the CUDA kernel when you try to access something from an embedding matrix outside the range [0, 8192).
A fix for this is to modify the model on disk by manually expanding the position embedding matrix.
@mayank31398 thanks for the answer, makes sense. Do you have a gist code somewhere you can refer me to regarding how to do that ?
(For other architectures, just defining:
model.config.max_position_embeddings = 10000 # instead 8192
is enough to increase context)
no, I think you can do something like following:
state_dict = model.state_dict()
weight = state_dict["transformer.wpe.weight"]
# modify shape here (should be [8192 x something]), try to repeat the matrix maybe? to make it [16384 x something], can also do linear interpolation
state_dict["transformer.wpe.weight"] = weight
# new state_dict can be loaded into a model with the new config which has a higher number of position embeddings
you will need to do some manual work
for other architectures, you don't even need to do that. they are using rope and it doesn't have a learnable matrix so it can theoretically extend indefinitely.
ok, thanks a lot
Hi, and thanks for the code models
I am having trouble fine-tuning "granite-34b-code-instruct" to a larger context. My script is quite robust, and I was able to increase context for other common models (see https://pr-agent-docs.codium.ai/finetuning_benchmark/)
However, when I try to increase context:
model.config.max_position_embeddings = 10000 # instead 8192
for granite-34B, I consistently get errors like:
when trying to investigate a bit, I see that the larger models have an architecture of 'GPTBigCodeForCausalLM', unlike the smaller models which follow the (more common) 'LlamaForCausalLM'
Do you have tips for what is needed to fine-tune and increase the context of the granite-34B model ?