Open r-wheeler opened 4 years ago
Are you looking for a "de-tokenizer" to recombine wordpiece? We're actively working on an approach for that. Or do you mean "combine decisions made by my model on wordpieces back to the word level"? I'll add someone who can talk more about that to the bug.
We are not looking to de-tokenize but rather pool (average|min|max|) the subwords back to rank 3 word level embeddings after sending through bert. Its not immediately clear how to keep track of the subword positions:
for example, for the given text:
tokenizer.tokenize('here are some tokens') # this is the bert tokenizer
This produces the tokens:
<tf.RaggedTensor [[[19353], [10301], [11152], [18436, 12457]]]>
there are 5 subword level tokens but the ragged tensor keeps track of the word level, indicating there are 4 words split on whitespace. This information is lost in the above example due to
tokens.merge_dims(
inner_axis=2, outer_axis=1)[:, :self.max_seq_len - 2]
After sending through bert we would like to have word level vectors. This is useful for algorithms where the labels are word level (as opposed to subword level) such as dependency parsing
I was able to get the words with number of tokens by having the tokenizer return start/end offsets and then do something like:
word_tok_lengths = starting_offsets.row_lengths(axis=2).to_tensor()
# get num_words as tensor from word_tok_lengths
word_ids = tf.range(0, num_words + 2) # for [CLS] and [SEP]
word_ids_per_token = tf.repeat(word_ids, word_tok_lengths)
pooled_by_word = tf.math.unsorted_segment_mean(subword_level_embeddings, word_ids_per_token, num_segments=num_words + 2)
This neglects a lot of padding / getting num_words, etc. This isn't ideal, though, because I wasn't able to do it as a fully vectorized operation - I had to loop over each element in the batch (and XLA doesn't work with tf.repeat b/c of an internal call to tf.where).
Please let me know if there's a more idiomatic way of doing this - the tf.math.unsorted_segment_foo
functions work well for doing the pooling, but assigning the tokens to words isn't easy to do in a vectorized way.
@r-wheeler Perhaps you just want to use a different tokenizer? IIRC, the BertTokenizer is a regex split and some normalization followed by WordpieceTokenizer. Is it correct to say you want everything but the WordpieceTokenizer?
@tc-wolf Would you prefer the same thing, or is it that you want the wordpieces and to know how many words they represent?
hey @broken thanks for taking a look -- sorry for not being clear. Attempting to use the bert tokenizer, send the tokens through bert, then pool berts rank3 sequence_output
back to the word level. Here pooling
would be to take the average|max|sum of the subword vectors, not over all the words -- by taking the sequence_output and offsets returned from the BertTokenizer.
in psudo code
input_tensor = tf.constant(['taste the rustisc indiefrost')
# use tf_text tokenizer with additional logic to create the masks
# get the offsets as well
word_ids, attention_mask, sequence_mask, start_offsets, end_offsets = BertTokenizerLayer(input_tensor)
_, sequence_output = BertKerasLayer(word_ids, input_mask, sequence_mask, max_seq_len) # this is from tensorflow hub
# is there a reference implementation of this that already exists?
word_level_sequences = pool_subword_to_word(sequence_output, start_offsets, end_offsets)
In the above example the tokenizer creates the following 8 subword tokens
[[[b'taste'], [b'the'], [b'rust', b'##is', b'##c'],
[b'indie', b'##fr', b'##ost']],
Which after being sent through bert produces a rank3 tensor with padding starting at sequence_output[:,7:,:]
pool_subword_to_word
would give back a rank3 tensor with zeros after
word_level_sequences[:,:4,:]
This is useful:
X
.X
@gregbillock combine the subword embeddings back to word level so we can train on targets that are at the word level
I see. We should define an idiomatic way of doing this, but I don't think we have one at the moment. Let me check around and get some input from others and get back to you.
@tc-wolf Would you prefer the same thing, or is it that you want the wordpieces and to know how many words they represent?
I'm also trying to do the same thing as @r-wheeler in order to pool the subword vectors to a word-level representation. My original comment should have said "number of tokens per word".
A quick update, I've been reaching out to others and looking at different solutions. Separately, we've been spending a lot of time planning on how to best handle scopes (eg. subword, word, sentence, etc), and this falls right in line with that. We should be able to deliver on a solution that is inline with this work. The person who has been mainly focused on ops to handle these scopes is out until next week though, and I want to make sure he is part of the discussions. We can then move this common use case up in priority.
Any update on this feature for now?
I wrote my own solution for this:
The following custom layer merges subwords using a specific input that tells which subwords belong together.
import numpy as np
import tensorflow as tf
class MergeSubwordsLayer(tf.keras.layers.Layer):
"""Merges consecutive subword embeddings to form fullword embeddings."""
def __init__(self, **kwargs):
super(MergeSubwordsLayer, self).__init__(**kwargs)
def build(self, input_shape):
super(MergeSubwordsLayer, self).build(input_shape)
def _merge_subwords(self, subword_vectors, full_word_indexes):
ragged_fw_indexes = tf.RaggedTensor.from_tensor(full_word_indexes, padding=-1, ragged_rank=2)
fullword_vectors = tf.gather(subword_vectors, ragged_fw_indexes, batch_dims=1)
# Reduce subwords by sum, mean or any other operation
fullword_vectors = tf.math.reduce_sum(fullword_vectors, axis=-2).to_tensor()
return fullword_vectors
def call(self, subword_vectors, full_word_indexes):
fullword_embeddings = self._merge_subwords(subword_vectors, full_word_indexes)
batch_size, _, embedding_dim = subword_vectors.shape
_, num_fullwords, _ = full_word_indexes.shape
fullword_embeddings.set_shape((batch_size, num_fullwords, embedding_dim))
return fullword_embeddings
def get_config(self):
config = {
}
config.update(super(MergeSubwordsLayer, self).get_config())
return config
This input is full_word_indexes
, which looks like:
tokens: ['joseph', 'harold', 'greenberg', 'may', '28', '1915', 'may', '7', '2001', 'was', 'an']
subtokens: ['joseph', 'har', '##old', 'green', '##berg', 'may', '2', '##8', '1', '##9', '##1', '##5', 'may', '7', '2', '##0', '##0', '##1', 'was', 'an']
full_word_indexes: [[0] [1 2] [3 4] [5] [6 7] [8 9 10 11] [12] [13] [14 15 16 17] [18] [19]]
However, since I'm doing this inside a tf.py_function
, I need to return a np array with a dense shape.
So it looks something like:
curr_index = -1
count = 0
full_word_indexes = np.zeros((self.max_words_len, self.max_tokens_len), dtype=np.int32) - 1
for i, subtoken in enumerate(subtokens):
if subtoken[:2] != '##':
curr_index += 1
count = 0
full_word_indexes[curr_index][count] = i
count += 1
Where self.max_words_len
is the maximum # of full words in an input sentence and self.max_tokens_len
is the maximum # of subword tokens in an input sentence.
The output shape for this layer is (batch_size, self.max_words_len, embedding_dim)
And here's a small sample of how to use it:
inp = Input((self.max_tokens_len, ), dtype=tf.int32, name='sub_tokens')
inp_full_word_indexes = Input((self.max_words_len, self.max_tokens_len), dtype=tf.int32, name='full_word_indexes')
var = bert_model(inp)
var = Dropout(dropout_rate)(var)
full_var = MergeSubwordsLayer()(var, inp_full_word_indexes)
full_var = Conv1D(self.vocab_size, kernel_size=1, activation='softmax', name='full_out')(full_var)
inputs = {
'sub_tokens': inp,
'full_word_indexes': inp_full_word_indexes
}
outputs = {
'full_out': full_var,
}
model = Model(inputs, outputs, name='word_vec')
Notes:
tf.py_function
affects the performance of the data generator. Try to avoid this.tf.gather
which is super fast during training/execution. It won't slow down your training times.I am using below lines via offset (token_start_indexes variable in code) to get word level embedding from sub-word level embedding; but, I think it is not ideal solution, it makes model slower.
x=pretrained_embedings_on_subword_level[:, token_start_indexes]
li=[x[i][i] for i,batch_i in enumerate(x)]
pretrained_embedings_on_word_level =tf.stack(li,dim=0)
to get offsets( here the first subword) below lines are used
# some of tokens might be seperated multi subtokens. to keep features true sequence,token_length is used
# pretrained tokenizer tokens : 'varlığından', 'mutlu', 'olunduğunu', 'bilmesi', ',', 'önemsen', '##diğini', 'hissetmesi',
# ud dataset tokens : 'varlığından', 'mutlu', 'olunduğunu', 'bilmesi', ',', 'önemsendiğini', 'hissetmesi',
# token_lengths : 1, 1, 1, 1, 1, 2, 1,
# token_start_indexes : 0, 1, 2, 3, 4, 5, 7,
token_lengths = [ [1] + [len(tokenizer.tokenize(w)) for w in s] + [1] for s in sentences_as_txtlist]
token_start_indexes=[[0 if i==0 else sum(s[:i]) for i,_ in enumerate(s)] for s in token_lengths]
You can also do it using gather function shown in this notebook. It is more effective than my previous solution.
We currently have some code that does the bert op in the graph. Do you have a method of pooling the bert tokens back to the word level? Just curious if you guys had an idiomatic method of doing this.
Currently the word level is flattened out in the call to
merge_dims()
However it would be nice to merge the subword vectors (after sending through bert) back to word level, via some pooling operation.