Closed akash418 closed 1 year ago
1) This issues seems old. Wasn't this fixed yet? See https://github.com/flairNLP/flair/issues/37#issuecomment-621763176 Did you try to increase the batch size?
2) "CLS pooling" refers to using last hidden state of the [CLS] token as embedding for the whole sequence (BERT uses this [CLS] token but GPT not). For GPT you can take for example the mean over all last hidden states (mean pooling, but exclude padded tokens) or the embedding of the EOS token.
Generally, you need to distingush between document level tasks (like classification) where you need an embedding for the whole document and token level tasks (sequence tagging like NER) where you need embeddings for each token.
3) See in my notebook. You can set the tokenizer pad token "embeddings.tokenizer.pad_token = embeddings.tokenizer.eos_token"
PS: If you feel more confident in implementing everything in HF transformers, you could do this too. But imho flair should be more easy.
This issue collects some findings obtained during the fine-tuning process for the german model for the classification and NER task.
Flair is not able to fully realize the capacity of the GPU (even multiple ones). This is a known issue, source: https://github.com/flairNLP/flair/issues/37, hence fine-tuning happens very slowly.
The pooling strategy is decided by the subtoken_pooling parameter of the Transformer Word Embedding. The known parameters are: first, last, mean, first_last. If we want to do pooling over [CLS] token in a transformer, then it's recommended to use the first token over the last layer of the transformer. Since the first token over the last layer is [CLS] only. Source: https://www.kaggle.com/code/rhtsingh/utilizing-transformer-representations-efficiently
The transformer embeddings might require to set the
pad_token
. The recommended way is to set it explicitly using something likeembeddings.tokenizer.add_special_tokens({'pad_token': '[PAD]'})
. If this is not set the correct way, then even if the fine-tuning starts, the execution will stop after the first iteration itself and the weights will not be updated at all.