tensorflow / text

Making text a first-class citizen in TensorFlow.
https://www.tensorflow.org/beta/tutorials/tensorflow_text/intro
Apache License 2.0
1.22k stars 340 forks source link

[Question] Pool bert subwords back to word level? #275

Open r-wheeler opened 4 years ago

r-wheeler commented 4 years ago

We currently have some code that does the bert op in the graph. Do you have a method of pooling the bert tokens back to the word level? Just curious if you guys had an idiomatic method of doing this.

Currently the word level is flattened out in the call to merge_dims()

However it would be nice to merge the subword vectors (after sending through bert) back to word level, via some pooling operation.

         tokens = self.tokenizer.tokenize(raw_text)

        # the tokenizer produces subword level ragged tensors
        # these need to be merged back to be word level per utterance
        # merge_dims() flattens [[[]]] -> [[]]

        # Trim the ragged tokens to max_seq_len - 2 (to account for CLS/SEP)
        ragged_tokens = tokens.merge_dims(
            inner_axis=2, outer_axis=1)[:, :self.max_seq_len - 2]
        ragged_tokens = tf.cast(ragged_tokens, tf.int32)

        # Concat CLS/SEP before conversion to sparse
        cls_tokens = tf.reshape(
            tf.tile([_CLS_ID], [tokens.nrows()]), [tokens.nrows(), 1])
        sep_tokens = tf.reshape(
            tf.tile([_SEP_ID], [tokens.nrows()]), [tokens.nrows(), 1])
        # add CLS and SEP to start and end
        ragged_tokens = tf.concat([cls_tokens,
                                   ragged_tokens,
                                   sep_tokens], axis=1)

        # to dense, fill in 0 with _PAD
        input_word_ids = ragged_tokens.to_tensor(default_value=_PAD_ID)

        paddings = [[0, 0],
                    [0, self.max_seq_len - tf.shape(input_word_ids)[1]]]

        input_word_ids = tf.pad(input_word_ids, paddings,
                                'CONSTANT', constant_values=_PAD_ID)

        # calculate the input masks and cast
        input_mask = tf.where((input_word_ids == _PAD_ID) |
                              (input_word_ids == _CLS_ID) |
                              (input_word_ids == _SEP_ID),
                              0,
                              tf.ones(self.max_seq_len, tf.int32))

        # calculate the segment ids
        segment_ids = tf.cast(input_word_ids > 0, tf.int32)
gregbillock commented 4 years ago

Are you looking for a "de-tokenizer" to recombine wordpiece? We're actively working on an approach for that. Or do you mean "combine decisions made by my model on wordpieces back to the word level"? I'll add someone who can talk more about that to the bug.

r-wheeler commented 4 years ago

We are not looking to de-tokenize but rather pool (average|min|max|) the subwords back to rank 3 word level embeddings after sending through bert. Its not immediately clear how to keep track of the subword positions:

for example, for the given text:

tokenizer.tokenize('here are some tokens') # this is the bert tokenizer

This produces the tokens:

<tf.RaggedTensor [[[19353], [10301], [11152], [18436, 12457]]]>

there are 5 subword level tokens but the ragged tensor keeps track of the word level, indicating there are 4 words split on whitespace. This information is lost in the above example due to

tokens.merge_dims(
            inner_axis=2, outer_axis=1)[:, :self.max_seq_len - 2]

After sending through bert we would like to have word level vectors. This is useful for algorithms where the labels are word level (as opposed to subword level) such as dependency parsing

tc-wolf commented 4 years ago

I was able to get the words with number of tokens by having the tokenizer return start/end offsets and then do something like:

word_tok_lengths = starting_offsets.row_lengths(axis=2).to_tensor()
# get num_words as tensor from word_tok_lengths
word_ids = tf.range(0, num_words + 2) # for [CLS] and [SEP]
word_ids_per_token = tf.repeat(word_ids, word_tok_lengths)

pooled_by_word = tf.math.unsorted_segment_mean(subword_level_embeddings, word_ids_per_token, num_segments=num_words + 2)

This neglects a lot of padding / getting num_words, etc. This isn't ideal, though, because I wasn't able to do it as a fully vectorized operation - I had to loop over each element in the batch (and XLA doesn't work with tf.repeat b/c of an internal call to tf.where).

Please let me know if there's a more idiomatic way of doing this - the tf.math.unsorted_segment_foo functions work well for doing the pooling, but assigning the tokens to words isn't easy to do in a vectorized way.

broken commented 4 years ago

@r-wheeler Perhaps you just want to use a different tokenizer? IIRC, the BertTokenizer is a regex split and some normalization followed by WordpieceTokenizer. Is it correct to say you want everything but the WordpieceTokenizer?

@tc-wolf Would you prefer the same thing, or is it that you want the wordpieces and to know how many words they represent?

r-wheeler commented 4 years ago

hey @broken thanks for taking a look -- sorry for not being clear. Attempting to use the bert tokenizer, send the tokens through bert, then pool berts rank3 sequence_output back to the word level. Here pooling would be to take the average|max|sum of the subword vectors, not over all the words -- by taking the sequence_output and offsets returned from the BertTokenizer.

in psudo code

input_tensor = tf.constant(['taste the rustisc indiefrost')

# use tf_text tokenizer with additional logic to create the masks 
# get the offsets as well
word_ids, attention_mask, sequence_mask, start_offsets, end_offsets = BertTokenizerLayer(input_tensor)
_, sequence_output = BertKerasLayer(word_ids, input_mask, sequence_mask, max_seq_len) # this is from tensorflow hub

# is there a reference implementation of this that already exists? 
word_level_sequences = pool_subword_to_word(sequence_output, start_offsets, end_offsets)

In the above example the tokenizer creates the following 8 subword tokens

[[[b'taste'], [b'the'], [b'rust', b'##is', b'##c'],
                     [b'indie', b'##fr', b'##ost']],

Which after being sent through bert produces a rank3 tensor with padding starting at sequence_output[:,7:,:]

pool_subword_to_word would give back a rank3 tensor with zeros after word_level_sequences[:,:4,:]

This is useful:

r-wheeler commented 4 years ago

@gregbillock combine the subword embeddings back to word level so we can train on targets that are at the word level

broken commented 4 years ago

I see. We should define an idiomatic way of doing this, but I don't think we have one at the moment. Let me check around and get some input from others and get back to you.

tc-wolf commented 4 years ago

@tc-wolf Would you prefer the same thing, or is it that you want the wordpieces and to know how many words they represent?

I'm also trying to do the same thing as @r-wheeler in order to pool the subword vectors to a word-level representation. My original comment should have said "number of tokens per word".

broken commented 4 years ago

A quick update, I've been reaching out to others and looking at different solutions. Separately, we've been spending a lot of time planning on how to best handle scopes (eg. subword, word, sentence, etc), and this falls right in line with that. We should be able to deliver on a solution that is inline with this work. The person who has been mainly focused on ops to handle these scopes is out until next week though, and I want to make sure he is part of the discussions. We can then move this common use case up in priority.

Mddct commented 3 years ago

Any update on this feature for now?

andreselizondo-adestech commented 3 years ago

I wrote my own solution for this:

The following custom layer merges subwords using a specific input that tells which subwords belong together.

import numpy as np
import tensorflow as tf

class MergeSubwordsLayer(tf.keras.layers.Layer):
    """Merges consecutive subword embeddings to form fullword embeddings."""

    def __init__(self, **kwargs):
        super(MergeSubwordsLayer, self).__init__(**kwargs)

    def build(self, input_shape):
        super(MergeSubwordsLayer, self).build(input_shape)

    def _merge_subwords(self, subword_vectors, full_word_indexes):
        ragged_fw_indexes = tf.RaggedTensor.from_tensor(full_word_indexes, padding=-1, ragged_rank=2)

        fullword_vectors = tf.gather(subword_vectors, ragged_fw_indexes, batch_dims=1)

        # Reduce subwords by sum, mean or any other operation
        fullword_vectors = tf.math.reduce_sum(fullword_vectors, axis=-2).to_tensor()

        return fullword_vectors

    def call(self, subword_vectors, full_word_indexes):
        fullword_embeddings = self._merge_subwords(subword_vectors, full_word_indexes)

        batch_size, _, embedding_dim = subword_vectors.shape
        _, num_fullwords, _ = full_word_indexes.shape
        fullword_embeddings.set_shape((batch_size, num_fullwords, embedding_dim))

        return fullword_embeddings

    def get_config(self):
        config = {
        }
        config.update(super(MergeSubwordsLayer, self).get_config())

        return config

This input is full_word_indexes, which looks like: tokens: ['joseph', 'harold', 'greenberg', 'may', '28', '1915', 'may', '7', '2001', 'was', 'an'] subtokens: ['joseph', 'har', '##old', 'green', '##berg', 'may', '2', '##8', '1', '##9', '##1', '##5', 'may', '7', '2', '##0', '##0', '##1', 'was', 'an'] full_word_indexes: [[0] [1 2] [3 4] [5] [6 7] [8 9 10 11] [12] [13] [14 15 16 17] [18] [19]]

However, since I'm doing this inside a tf.py_function, I need to return a np array with a dense shape. So it looks something like:

curr_index = -1
count = 0
full_word_indexes = np.zeros((self.max_words_len, self.max_tokens_len), dtype=np.int32) - 1
for i, subtoken in enumerate(subtokens):
    if subtoken[:2] != '##':
        curr_index += 1
        count = 0
    full_word_indexes[curr_index][count] = i
    count += 1

Where self.max_words_len is the maximum # of full words in an input sentence and self.max_tokens_len is the maximum # of subword tokens in an input sentence. The output shape for this layer is (batch_size, self.max_words_len, embedding_dim)

And here's a small sample of how to use it:

inp = Input((self.max_tokens_len, ), dtype=tf.int32, name='sub_tokens')
inp_full_word_indexes = Input((self.max_words_len, self.max_tokens_len), dtype=tf.int32, name='full_word_indexes')

var = bert_model(inp)
var = Dropout(dropout_rate)(var)
full_var = MergeSubwordsLayer()(var, inp_full_word_indexes)
full_var = Conv1D(self.vocab_size, kernel_size=1, activation='softmax', name='full_out')(full_var)

inputs = {
    'sub_tokens': inp,
    'full_word_indexes': inp_full_word_indexes
}

outputs = {
    'full_out': full_var,
}

model = Model(inputs, outputs, name='word_vec')

Notes:

maltintas45 commented 3 years ago

I am using below lines via offset (token_start_indexes variable in code) to get word level embedding from sub-word level embedding; but, I think it is not ideal solution, it makes model slower.

x=pretrained_embedings_on_subword_level[:, token_start_indexes]
li=[x[i][i] for i,batch_i in enumerate(x)]
pretrained_embedings_on_word_level =tf.stack(li,dim=0)

to get offsets( here the first subword) below lines are used

# some of tokens might be seperated multi subtokens. to keep features true sequence,token_length is used
# pretrained tokenizer tokens   : 'varlığından',   'mutlu',   'olunduğunu',   'bilmesi',   ',',   'önemsen',   '##diğini',   'hissetmesi',
# ud dataset tokens             : 'varlığından',   'mutlu',   'olunduğunu',   'bilmesi',   ',',   'önemsendiğini',           'hissetmesi',
# token_lengths                 :   1,                 1,            1,           1,        1,        2,                       1,
# token_start_indexes           :   0,                 1,            2,           3,        4,        5,                       7,
token_lengths = [ [1] + [len(tokenizer.tokenize(w)) for w in s] + [1] for s in sentences_as_txtlist]
token_start_indexes=[[0 if i==0 else sum(s[:i]) for i,_ in enumerate(s)] for s in token_lengths]
maltintas45 commented 3 years ago

You can also do it using gather function shown in this notebook. It is more effective than my previous solution.