Open dangal95 opened 5 years ago
We have been trying to do the same thing. One thing we tried is tagging the sequences after wordpiece tokenization. So in our case
Jim Hen ##son was a puppet ##eer
B-PER I-PER X O O O X
becomes
Jim Hen ##son was a puppet ##eer
B-PER I-PER I-PER O O O O
And while decoding we merge the tags for subtokens of a token like this.
def convert_to_original_length(sentence, tags):
r = []
r_tags = []
for index, token in enumerate(tokenizer.tokenize(sentence)):
if token.startswith("##"):
if r:
r[-1] = f"{r[-1]}{token[2:]}"
else:
r.append(token)
r_tags.append(tags[index])
return r_tags
We found it work better than taking the tag of first subtoken.
@dangal95
i had very similar problem before. in my case, i need to integrate with BERT embedding with Glove, ELMo, word embedding from CNN. there are many possible solutions to align.
then, how to compute pooled embedding from series of token embeddings?
if you seek fine-tuning, the method mentioned by @ashutoshsingh0223 will be better.
@ashutoshsingh0223 how did you handle the [SEP] and [CLS] tokens?
We have been trying to do the same thing. One thing we tried is tagging the sequences after wordpiece tokenization. So in our case
Jim Hen ##son was a puppet ##eer B-PER I-PER X O O O X
becomes
Jim Hen ##son was a puppet ##eer B-PER I-PER I-PER O O O O
And while decoding we merge the tags for subtokens of a token like this.
def convert_to_original_length(sentence, tags): r = [] r_tags = [] for index, token in enumerate(tokenizer.tokenize(sentence)): if token.startswith("##"): if r: r[-1] = f"{r[-1]}{token[2:]}" else: r.append(token) r_tags.append(tags[index]) return r_tags
We found it work better than taking the tag of first subtoken.
I tried to adopt this schema. But I encountered certain situations that is not trivial to tackle. The problem is that, the correspondance of (sub)word indexes is hard to get. Surely the "##" is a common hint for split. But the following situations make it not sufficient:
[^1]: here we use a list of index pair to express the correspondance, with index in the pair stands for the start of the same word.
So I used another trick free of pattern match: for every word in the original sentence, wo tokenize it separately, and collect the correspondance information:
def get_index_correspondence(sent, tokenizer):
"""
due to cases like: "1996-08-22"=>"1996", "-", "08", "-", "22", we need exact position correspondance
A = ["Brussels", "1996-08-22"]
B = ["br", "##us", "##se", "##ls", "1996", "-", "08", "-", "22"]
"""
correspondence = [(0,0)]
for word in sent:
(raw_end, expand_end) = correspondence[-1]
correspondence.append((raw_end+1, expand_end+len(tokenizer.tokenize(word))))
return correspondence
I am trying to do multi-class sequence classification using the BERT uncased base model and tensorflow/keras. However, I have an issue when it comes to labeling my data following the BERT wordpiece tokenizer. I am unsure as to how I should modify my labels following the tokenization procedure.
I have read several open and closed issues on Github about this problem and I've also read the BERT paper published by Google. Specifically in section 4.3 of the paper there is an explanation of how to adjust the labels but I'm having trouble translating it to my case. I've also read the official BERT repository README which has a section on tokenization and mentions how to create a type of dictionary that maps the original tokens to the new tokens and that this can be used as a way to project my labels.
I have used the code provided in the README and managed to create labels in the way I think they should be. However, I am not sure if this is the correct way to do it. Below is an example of a tokenized sentence and it's labels before and after using the BERT tokenizer. Just a side-note. I have adjusted some of the code in the tokenizer so that it does not tokenize certain words based on punctuation as I would like them to remain whole.
This is the code to create the mapping:
Using the mapping I adjust my label array and it becomes like the following:
Following this I add padding labels (let's say that the maximum sequence length is 12) and so finally my label array looks like this:
As you can see since the last token (labeled 1) was split into two pieces I now label both word pieces as '1'.
I am not sure if this is correct. In section 4.3 of the paper they are labelled as 'X' but I'm not sure if this is what I should also do in my case. So in the paper (https://arxiv.org/abs/1810.04805) the following example is given:
My final goal is to input a sentence into the model and as a result get back an array which can look something like [5, 0, 0, 1, 1, 2, 3, 4, 5, 5, 5, 5 ]. So one label per word piece. Then I can reconstruct the words back together to get the original length of the sentence and therefore the way the prediction values should actually look like.
Also, another option (following the section 4.3 example from the paper would be to introduce a new label (say number '6') that is used for the word-parts. So my label would look like:
After training the model for a couple of epochs I attempt to make predictions and get weird values. For example a word is marked with the label '5' for padding and padding values get marked with the label '1'. This makes me think that there is something wrong with the way I create labels. Initially I did not adjust the labels so I would leave the labels as they were originally even after tokenizing the original sentence. This did not give me good results.
Any help would be greatly appreciated as I've been trying hard to find what I should do online but I haven't been able to figure it out yet. Thank you in advance!
Also, the following is the code I use to create my model: