Closed okpatil4u closed 1 year ago
Hi, classification should be just another linear layer at the end when the embeddings are done.
In the model conversion script, replace the automodel with BertForSequenceClassification or whichever you are using. During this step, check the name of the classification layer and modify the c++ to load and rename that layer.
Then in the model eval code, take out the mean pooling and embeddings normalization and add multiplication with the classification matrix. I think for classification you take the embeddings of the first token and discard everything else.
Thanks. Apparently Bert uses tanh for for pooler. In your code you have defined pooler as following
// pooler
struct ggml_tensor *sum = ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, N, 1);
ggml_set_f32(sum, 1.0f / N);
inpL = ggml_mul_mat(ctx0, inpL, sum);
// normalizer
ggml_tensor *length = ggml_sqrt(ctx0,
ggml_sum(ctx0, ggml_sqr(ctx0, inpL)));
inpL = ggml_scale(ctx0, inpL, ggml_div(ctx0, ggml_new_f32(ctx0, 1.0f), length));
ggml_tensor *output = inpL;
This may make sense in case if you need to calculate just the embeddings. But in my case, there is a tanh activation after the pooler. I have to load the pooler, apply tanh, apply dropout and then apply classification layer. This is how my code looks like.
// pooler
inpL = ggml_mul_mat(ctx0, model.pooler_w, inpL);
inpL = ggml_add(ctx0, ggml_repeat(ctx0, model.pooler_b, inpL), inpL);
// Apply Tanh activation function
inpL = ggml_tanh(ctx0, inpL);
// Apply dropout
inpL = ggml_dropout(ctx0, inpL, model.dropout);
// Adding the classifier layer here
inpL = ggml_mul_mat(ctx0, model.classifier_w, inpL);
inpL = ggml_add(ctx0, ggml_repeat(ctx0, model.classifier_b, inpL), inpL);
Is this correct ? Apparently there are no ggml_tanh and ggml_dropout functions in the ggml library. Do I have to implement them from scratch ? Also, I didn't find dropout anywhere in the code, although it has been mentioned in the model specification. Is it calculated implicitly ?
Dropout is only used during training and you can just ignore it when implementing inference.
Tanh is indeed missing from ggml. You could either try to implement it as an operator in ggml or use ggml_map_unary_f32
to wrap tanh into a tensor function.
Thanks. This is very useful.
Just as a context, I am trying to convert prajjwal1/bert-tiny. It has a layer "bert.embeddings.position_ids". After reading this layer and before reading "bert.embeddings.word_embeddings.weight", the following line is giving segmentation fault.
int64_t nelements = 1;
int64_t ne[2] = {1, 1};
for (int i = 0; i < n_dims; ++i)
{
int32_t ne_cur;
fin.read(reinterpret_cast<char *>(&ne_cur), sizeof(ne_cur));
ne[i] = ne_cur;
nelements *= ne[i];
}
Am I supposed to load position_ids in a some different way ? Can you please give this model a try by any chance ?
position_ids is just a vector with indices 0,1,2,3... In bert.cpp, it's here: https://github.com/skeskinen/bert.cpp/blob/master/bert.cpp#L791
For a CPU impelementation it doesn't really make sense to store it as parameter to the model. It's done in pytorch like that for GPU memory related reasons. But it's just a for loop.
Sure. But if I skip them, I get following error.
bert_load_from_file: unknown tensor 'bert.embeddings.position_ids' in model file
If I don't, then I get segmentation fault. Am I missing something ?
While creating the model file, don't store the tensors you don't load. Or modify the code to ignore the error.
Got it. You had skipped them using "embeddings.position_ids". I had to change it to "bert.embeddings.position_ids" in convert ggml file.
Can I skip the pooler weights as well ? As you had mentioned in your first reply ? Or do I have to implement them including tanh ?
The pooler weights are skipped in bert.cpp, because sbert.net doesn't use the vanilla BERT pooler. For classification, I think, you should use the vanilla pooler so that includes the learned pooler weights and the tanh step.
Thanks man !
I was able to get it working by temporarily using relu instead of tanh.
inpL = ggml_mul_mat(ctx0, model.pooler_w, inpL);
inpL = ggml_add(ctx0, ggml_repeat(ctx0, model.pooler_b, inpL), inpL);
// Apply Tanh activation function
inpL = ggml_relu(ctx0, inpL);
inpL = ggml_mul_mat(ctx0, model.classifier_w, inpL);
inpL = ggml_add(ctx0, ggml_repeat(ctx0, model.classifier_b, inpL), inpL);
ggml_tensor *output = inpL;
// run the computation
ggml_build_forward_expand(&gf, output);
ggml_graph_compute(ctx0, &gf);
Does it seem like a correct code ? At least it is working with your server and client implementation. I will implement tanh later, after your comments.
The code looks at least reasonable. It's probably close to being right. Compare the outputs with the original, I guess?
Only thing I'd check is that afaik inpL
in the beginning should be just the embeddings for the first token (CLS or BOS token). Maybe you do that already but it's not shown here.
Or maybe I'm wrong, like I said, check the original code
I am not sure. I replaced these lines with the code above.
inpL = ggml_cont(ctx0, ggml_transpose(ctx0, inpL));
// pooler
struct ggml_tensor *sum = ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, N, 1);
ggml_set_f32(sum, 1.0f / N);
inpL = ggml_mul_mat(ctx0, inpL, sum);
// normalizer
ggml_tensor *length = ggml_sqrt(ctx0,
ggml_sum(ctx0, ggml_sqr(ctx0, inpL)));
inpL = ggml_scale(ctx0, inpL, ggml_div(ctx0, ggml_new_f32(ctx0, 1.0f), length));
ggml_tensor *output = inpL;
// run the computation
ggml_build_forward_expand(&gf, output);
ggml_graph_compute(ctx0, &gf);
Is this correct ?
It is giving correct results. Thank you for your help !
@okpatil4u I want to use classification models as well, are your changes public? I would like to work with them.
@okpatil4u pinging you again: I want to use classification models as well, are your changes public? I would like to work with them.
The reason I didn’t make it public because it doesn’t work. There is at least 25% accuracy drop when compared with the same model in huggingface/transformers. We ended up rebuilding it in huggingface/candle from scratch. You are welcome to try it.
On Thu, 21 Sep 2023 at 7:49 AM, red thing @.***> wrote:
@okpatil4u https://github.com/okpatil4u pinging you again: I want to use classification models as well, are your changes public? I would like to work with them.
— Reply to this email directly, view it on GitHub https://github.com/skeskinen/bert.cpp/issues/11#issuecomment-1728667790, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXGU4FQ6OVKOXQVY6JCLADX3OP3HANCNFSM6AAAAAAYOSS3VI . You are receiving this because you were mentioned.Message ID: @.***>
Thank you for the update.
@okpatil4u
The reason I didn’t make it public because it doesn’t work. There is at least 25% accuracy drop when compared with the same model in huggingface/transformers. We ended up rebuilding it in huggingface/candle from scratch. You are welcome to try it. … On Thu, 21 Sep 2023 at 7:49 AM, red thing @.> wrote: @okpatil4u https://github.com/okpatil4u pinging you again: I want to use classification models as well, are your changes public? I would like to work with them. — Reply to this email directly, view it on GitHub <#11 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXGU4FQ6OVKOXQVY6JCLADX3OP3HANCNFSM6AAAAAAYOSS3VI . You are receiving this because you were mentioned.Message ID: @.>
Would you mind sharing the source modifications you made? I would like to troubleshoot.
Is it possible to retrofit BERT classification model into this code ?
Can you please provide me some guidelines so that I can take care of it myself ?
Thanks in advance.