strongio / keras-bert

A simple technique to integrate BERT from tf hub to keras
258 stars 108 forks source link

Multiclass classification #3

Closed plchld closed 5 years ago

plchld commented 5 years ago

Hello ! Thanks for the notebook, it is really helpful! I am trying to make it work for multiclass classification but I have some difficulties. My dataset its strings with multiple labels, which I one-hot encode before I train/test split them and feed them into the class 'Inputexample'. It seems to work after that, but when I try to call the model later on it gives me the following error.

"Input arrays should have the same number of samples as target arrays. Found 10251 input samples and 51255 target samples."

I suspect it has something to do with how it converts y to features since 10251 x 5 = 51255 and I have 5 classes. Is there something inherent to binary classification in your code that would raise this error?

jacobzweig commented 5 years ago

Hey @v4d0k – in convert_examples_to_features I reshape the labels array in the function return with np.array(labels).reshape(-1, 1). You'd want to change that to match your features.

plchld commented 5 years ago

Thank you!

AxeldeRomblay commented 5 years ago

Hello,

Like @v4d0k I have tried to apply your code on a multiclass problem (where each text/description can belong to several classes; eg: "Cristiano Ronaldo amazing goal vs Juventus" belongs both to "sport" and "football" classes). I have removed the .reshape(-1, 1) in convert_examples_to_features but after compiling the model with success, I have the following error when fitting the model :

InvalidArgumentError: logits must be 2-dimensional [[Node: bert_layer_1/bert_layer_2_module_apply_tokens/bert/encoder/layer_0/attention/self/Softmax = Softmax[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](bert_layer_1/bert_layer_2_module_apply_tokens/bert/encoder/layer_0/attention/self/add)]]

Any ideas ? Also can you precise which tensorflow version you are using please ?
Thank you very much !

Abhinav43 commented 5 years ago

@AxeldeRomblay have you found how to use this for multi class?

Abhinav43 commented 5 years ago

@jacobzweig What changes i have to do in your model for multi class?

Abhinav43 commented 5 years ago

@v4d0k did you find how to work with mulit-class? please share what changes we have to do??

AxeldeRomblay commented 5 years ago

@Abhinav43 it was just a version issue... Make sure you have tensorflow 1.14 and it should work ! :)

Abhinav43 commented 5 years ago

@AxeldeRomblay I am getting other errors and not getting good accuracy in multi-class. can you share the code or tell me where I have to change in code for multi-label?

AxeldeRomblay commented 5 years ago

@Abhinav43 here is the code :

`class BertLayer(tf.keras.layers.Layer): def init( self, n_fine_tune_layers=0, pooling="first", bert_path=BERT_PATH, **kwargs, ): self.n_fine_tune_layers = n_fine_tune_layers self.trainable = True self.output_size = BERT_OUTPUT self.pooling = pooling self.bert_path = bert_path if self.pooling not in ["first", "mean"]: raise NameError( "Undefined pooling type (must be either first or mean)" )

    super(BertLayer, self).__init__(**kwargs)

def build(self, input_shape):
    self.bert = hub.Module(
        self.bert_path,
        trainable=self.trainable,
        name="{}_module".format(self.name),
    )

    # Remove unused layers
    trainable_vars = self.bert.variables
    if self.pooling == "first":
        trainable_vars = [
            var for var in trainable_vars if not "/cls/" in var.name
        ]
        trainable_layers = ["pooler/dense"]

    elif self.pooling == "mean":
        trainable_vars = [
            var
            for var in trainable_vars
            if not "/cls/" in var.name and not "/pooler/" in var.name
        ]
        trainable_layers = []
    else:
        raise NameError(
            "Undefined pooling type (must be either first or mean)"
        )

    # Select how many layers to fine tune
    for i in range(self.n_fine_tune_layers):
        trainable_layers.append("encoder/layer_{}".format(str(11 - i)))

    # Update trainable vars to contain only the specified layers
    trainable_vars = [
        var
        for var in trainable_vars
        if any([l in var.name for l in trainable_layers])
    ]

    # Add to trainable weights
    for var in trainable_vars:
        self._trainable_weights.append(var)

    for var in self.bert.variables:
        if var not in self._trainable_weights:
            self._non_trainable_weights.append(var)

    super(BertLayer, self).build(input_shape)

def call(self, inputs):
    inputs = [K.cast(x, dtype="int32") for x in inputs]
    input_ids, input_mask, segment_ids = inputs
    bert_inputs = dict(
        input_ids=input_ids, input_mask=input_mask, segment_ids=segment_ids
    )
    if self.pooling == "first":
        pooled = self.bert(
            inputs=bert_inputs, signature="tokens", as_dict=True
        )["pooled_output"]
    elif self.pooling == "mean":
        result = self.bert(
            inputs=bert_inputs, signature="tokens", as_dict=True
        )["sequence_output"]

        mul_mask = lambda x, m: x * tf.expand_dims(m, axis=-1)
        masked_reduce_mean = lambda x, m: tf.reduce_sum(
            mul_mask(x, m), axis=1
        ) / (tf.reduce_sum(m, axis=1, keepdims=True) + 1e-10)
        input_mask = tf.cast(input_mask, tf.float32)
        pooled = masked_reduce_mean(result, input_mask)
    else:
        raise NameError(
            "Undefined pooling type (must be either first or mean)"
        )

    return pooled

def compute_output_shape(self, input_shape):
    return input_shape[0], self.output_size`
AxeldeRomblay commented 5 years ago

`class InputExample(object): """A single training/test example for simple sequence classification."""

def __init__(self, guid, text_a, text_b=None, label=None):
    """Constructs a InputExample.
Args:
  guid: Unique id for the example.
  text_a: string. The untokenized text of the first sequence. For single
    sequence tasks, only this sequence must be specified.
  text_b: (Optional) string. The untokenized text of the second sequence.
    Only must be specified for sequence pair tasks.
  label: (Optional) string. The label of the example. This should be
    specified for train and dev examples, but not for test examples.
"""
    self.guid = guid
    self.text_a = text_a
    self.text_b = text_b
    self.label = label

def convert_single_example(tokenizer, example, max_seq_length=512): """Converts a single InputExample into a single InputFeatures."""

if isinstance(example, PaddingInputExample):
    input_ids = [0] * max_seq_length
    input_mask = [0] * max_seq_length
    segment_ids = [0] * max_seq_length
    label = 0
    return input_ids, input_mask, segment_ids, label

tokens_a = tokenizer.tokenize(example.text_a)
if len(tokens_a) > max_seq_length - 2:
    tokens_a = tokens_a[0 : (max_seq_length - 2)]

tokens = []
segment_ids = []
tokens.append("[CLS]")
segment_ids.append(0)
for token in tokens_a:
    tokens.append(token)
    segment_ids.append(0)
tokens.append("[SEP]")
segment_ids.append(0)

input_ids = tokenizer.convert_tokens_to_ids(tokens)

# The mask has 1 for real tokens and 0 for padding tokens. Only real
# tokens are attended to.
input_mask = [1] * len(input_ids)

# Zero-pad up to the sequence length.
while len(input_ids) < max_seq_length:
    input_ids.append(0)
    input_mask.append(0)
    segment_ids.append(0)

assert len(input_ids) == max_seq_length
assert len(input_mask) == max_seq_length
assert len(segment_ids) == max_seq_length

return input_ids, input_mask, segment_ids, example.label

def convert_examples_to_features(tokenizer, examples, max_seq_length=512): """Convert a set of InputExamples to a list of InputFeatures."""

input_ids, input_masks, segment_ids, labels = [], [], [], []
for example in examples:
    input_id, input_mask, segment_id, label = convert_single_example(
        tokenizer, example, max_seq_length
    )
    input_ids.append(input_id)
    input_masks.append(input_mask)
    segment_ids.append(segment_id)
    labels.append(label)
return (
    np.array(input_ids).astype(np.int32),
    np.array(input_masks).astype(np.int32),
    np.array(segment_ids).astype(np.int32),
    np.array(labels),
)

def convert_text_to_examples(texts, labels): """Create InputExamples""" InputExamples = [] for text, label in zip(texts, labels): InputExamples.append( InputExample( guid=None, text_a=" ".join(text), text_b=None, label=label ) ) return InputExamples`

Abhinav43 commented 5 years ago

I am looking for model information. Should i use binary cross-entropy other one?

elsheikh21 commented 5 years ago

@AxeldeRomblay I am trying to integrate bert as an embeddings layer in my model, however, everytime I get this Traceback

Traceback (most recent call last):
  File "code/prova_bert.py", line 230, in <module>
    model = baseline_model(output_size, max_seq_len, visualize=True)
  File "code/prova_bert.py", line 165, in baseline_model
    )(bert_embeddings)
  File "C:\Users\Sheikh\AppData\Local\Programs\Python\Python36\Lib\site-packages\tensorflow\python\keras\layers\wrappers.py", line 473, in __call__
    return super(Bidirectional, self).__call__(inputs, **kwargs)
  File "C:\Users\Sheikh\AppData\Local\Programs\Python\Python36\Lib\site-packages\tensorflow\python\keras\engine\base_layer.py", line 746, in __call__
    self.build(input_shapes)
  File "C:\Users\Sheikh\AppData\Local\Programs\Python\Python36\Lib\site-packages\tensorflow\python\keras\layers\wrappers.py", line 612, in build
    self.forward_layer.build(input_shape)
  File "C:\Users\Sheikh\AppData\Local\Programs\Python\Python36\Lib\site-packages\tensorflow\python\keras\utils\tf_utils.py", line 149, in wrapper
    output_shape = fn(instance, input_shape)
  File "C:\Users\Sheikh\AppData\Local\Programs\Python\Python36\Lib\site-packages\tensorflow\python\keras\layers\recurrent.py", line 552, in build
    self.cell.build(step_input_shape)
  File "C:\Users\Sheikh\AppData\Local\Programs\Python\Python36\Lib\site-packages\tensorflow\python\keras\utils\tf_utils.py", line 149, in wrapper
    output_shape = fn(instance, input_shape)
  File "C:\Users\Sheikh\AppData\Local\Programs\Python\Python36\Lib\site-packages\tensorflow\python\keras\layers\recurrent.py", line 1934, in build
    constraint=self.kernel_constraint)
  File "C:\Users\Sheikh\AppData\Local\Programs\Python\Python36\Lib\site-packages\tensorflow\python\keras\engine\base_layer.py", line 609, in add_weight
    aggregation=aggregation)
  File "C:\Users\Sheikh\AppData\Local\Programs\Python\Python36\Lib\site-packages\tensorflow\python\training\checkpointable\base.py", line 639, in _add_variable_with_custom_getter
    **kwargs_for_getter)
  File "C:\Users\Sheikh\AppData\Local\Programs\Python\Python36\Lib\site-packages\tensorflow\python\keras\engine\base_layer.py", line 1977, in make_variable
    aggregation=aggregation)
  File "C:\Users\Sheikh\AppData\Local\Programs\Python\Python36\Lib\site-packages\tensorflow\python\ops\variables.py", line 183, in __call__
    return cls._variable_v1_call(*args, **kwargs)
  File "C:\Users\Sheikh\AppData\Local\Programs\Python\Python36\Lib\site-packages\tensorflow\python\ops\variables.py", line 146, in _variable_v1_call
    aggregation=aggregation)
  File "C:\Users\Sheikh\AppData\Local\Programs\Python\Python36\Lib\site-packages\tensorflow\python\ops\variables.py", line 125, in <lambda>
    previous_getter = lambda **kwargs: default_variable_creator(None, **kwargs)
  File "C:\Users\Sheikh\AppData\Local\Programs\Python\Python36\Lib\site-packages\tensorflow\python\ops\variable_scope.py", line 2437, in default_variable_creator
    import_scope=import_scope)
  File "C:\Users\Sheikh\AppData\Local\Programs\Python\Python36\Lib\site-packages\tensorflow\python\ops\variables.py", line 187, in __call__
    return super(VariableMetaclass, cls).__call__(*args, **kwargs)
  File "C:\Users\Sheikh\AppData\Local\Programs\Python\Python36\Lib\site-packages\tensorflow\python\ops\resource_variable_ops.py", line 297, in __init__
    constraint=constraint)
  File "C:\Users\Sheikh\AppData\Local\Programs\Python\Python36\Lib\site-packages\tensorflow\python\ops\resource_variable_ops.py", line 409, in _init_from_args
    initial_value() if init_from_fn else initial_value,
  File "C:\Users\Sheikh\AppData\Local\Programs\Python\Python36\Lib\site-packages\tensorflow\python\keras\engine\base_layer.py", line 1959, in <lambda>
    shape, dtype=dtype, partition_info=partition_info)
  File "C:\Users\Sheikh\AppData\Local\Programs\Python\Python36\Lib\site-packages\tensorflow\python\ops\init_ops.py", line 473, in __call__
    scale /= max(1., (fan_in + fan_out) / 2.)
TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'
Exception ignored in: <bound method BaseSession.__del__ of <tensorflow.python.client.session.Session object at 0x0000026396AD0630>>
Traceback (most recent call last):
  File "C:\Users\Sheikh\AppData\Local\Programs\Python\Python36\Lib\site-packages\tensorflow\python\client\session.py", line 738, in __del__
TypeError: 'NoneType' object is not callable

Here is my model, and I am using the implementation of strongio to the bert layer

here is an example of my data

# before tokenization, both are list of lists
train_x[0] = ['how long have it be since you review the objective of you benefit and service program ?']
train_y[0] =  [101, 1365, 13, 14, 20, 127, 32, 7939, 2, 2977, 5, 32, 7570, 6, 25584, 3785, 45]

""" 
# after those lines
tokenizer = create_tokenizer_from_hub_module()
train_examples = convert_text_to_examples(train_text, train_labels)
"""
train_examples[0].text_a = 'how long have it be since you review the objective of you benefit and service program ?'
train_examples[0].label= [101, 1365, 13, 14, 20, 127, 32, 7939, 2, 2977, 5, 32, 7570, 6, 25584, 3785, 45]

"""
# After the following lines
# Extract features
(train_input_ids, train_input_masks, train_segment_ids, train_labels) = convert_examples_to_features(tokenizer, train_examples, max_seq_length=max_seq_len) # max_seq_len = 512
"""

train_input_ids[0] = [ 101 2129 2146 2031 2009 2022 2144 2017 3319 1996 7863 1997 2017 5770
 1998 2326 2565 1029  102    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0]

train_input_masks[0] = [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

train_segment_ids[0] = [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

train_labels[0] = [101, 1365, 13, 14, 20, 127, 32, 7939, 2, 2977, 5, 32, 7570, 6, 25584, 3785, 45]

Thank you and you may refer to my issue

elsheikh21 commented 4 years ago

Solved my issue stackoverflow