finetuning with larger context

ibm-granite / granite-code-models

Granite Code Models: A Family of Open Foundation Models for Code Intelligence

https://huggingface.co/collections/ibm-granite/granite-code-models-6624c5cec322e4c148c8b330

Apache License 2.0

1.14k stars 81 forks source link

finetuning with larger context #16

Closed mrT23 closed 4 months ago

mrT23 commented 4 months ago

Hi, and thanks for the code models

I am having trouble fine-tuning "granite-34b-code-instruct" to a larger context. My script is quite robust, and I was able to increase context for other common models (see https://pr-agent-docs.codium.ai/finetuning_benchmark/)

However, when I try to increase context: model.config.max_position_embeddings = 10000 # instead 8192

for granite-34B, I consistently get errors like:

../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [120,0,0], thread: [96,0,0] Assertion `srcIndex < srcSelectDimSize` failed.

when trying to investigate a bit, I see that the larger models have an architecture of 'GPTBigCodeForCausalLM', unlike the smaller models which follow the (more common) 'LlamaForCausalLM'

Do you have tips for what is needed to fine-tune and increase the context of the granite-34B model ?

mayank31398 commented 4 months ago

the 20 and 34b used a position embedding matrix of fixed size (8192) and the error you are seeing is the error thrown in the CUDA kernel when you try to access something from an embedding matrix outside the range [0, 8192).

A fix for this is to modify the model on disk by manually expanding the position embedding matrix.

mrT23 commented 4 months ago

@mayank31398 thanks for the answer, makes sense. Do you have a gist code somewhere you can refer me to regarding how to do that ?

(For other architectures, just defining:

model.config.max_position_embeddings = 10000 # instead 8192

is enough to increase context)

mayank31398 commented 4 months ago

no, I think you can do something like following:

state_dict = model.state_dict()
weight = state_dict["transformer.wpe.weight"]
# modify shape here (should be [8192 x something]), try to repeat the matrix maybe? to make it [16384 x something], can also do linear interpolation
state_dict["transformer.wpe.weight"] = weight
# new state_dict can be loaded into a model with the new config which has a higher number of position embeddings

you will need to do some manual work

mayank31398 commented 4 months ago

for other architectures, you don't even need to do that. they are using rope and it doesn't have a learnable matrix so it can theoretically extend indefinitely.

mrT23 commented 4 months ago

ok, thanks a lot