Issue with multiclass text classification

jplu commented 5 years ago

Hello,

I'm trying to run the jupyter for predicting the IMDB movie reviews, but on a different dataset. Basically the codebase is the same except the part that parse the dataset. My dataset has multiple labels and when I run the jupyter I get the following error:

tensorflow.python.framework.errors_impl.InvalidArgumentError: Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

Assign requires shapes of both tensors to match. lhs shape= [9] rhs shape= [2]
         [[node save_2/Assign_399 (defined at ad_classification.py:287)  = Assign[T=DT_FLOAT, _class=["loc:@output_bias"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/device:CPU:0"](output_bias, save_2/RestoreV2:603)]]

To reproduce this issue, one can run the following python code:

# coding=utf-8

import pandas as pd
import tensorflow as tf
from sklearn.model_selection import train_test_split
import tensorflow_hub as hub
from datetime import datetime
import bert
from bert import run_classifier
from bert import optimization
from bert import tokenization

OUTPUT_DIR = 'output_dir'
label_list = set()

# Load all files from a directory in a DataFrame.
def load_directory_data():
    data = {}
    data["sentence"] = []
    data["class"] = []
    with tf.gfile.GFile("./ad_all.train.head", "r") as f:
        for line in f:
            columns = line.strip().split("\t")
            if len(columns) == 2:
                data["sentence"].append(columns[1])
                label = columns[0].split("__label__")[1]
                data["class"].append(label)
                label_list.add(label)
    X_train, X_test, y_train, y_test = train_test_split(data["sentence"], data["class"], test_size=0.33, random_state=42)

    train = {}
    train["sentence"] = X_train
    train["class"] = y_train
    test = {}
    test["sentence"] = X_test
    test["class"] = y_test

    return pd.DataFrame.from_dict(train), pd.DataFrame.from_dict(test)

# Merge positive and negative examples, add a polarity column and shuffle.
def load_dataset():
    df_train, df_test = load_directory_data()

    return pd.concat([df_train]).sample(frac=1).reset_index(drop=True), pd.concat([df_test]).sample(frac=1).reset_index(drop=True)

# Download and process the dataset files.
def download_and_load_datasets():
    train_df, test_df = load_dataset()

    return train_df, test_df

train, test = download_and_load_datasets()
label_list = list(label_list)
DATA_COLUMN = 'sentence'
LABEL_COLUMN = 'class'

# Use the InputExample class from BERT's run_classifier code to create examples from the data
train_InputExamples = train.apply(lambda x: bert.run_classifier.InputExample(guid=None, # Globally unique ID for bookkeeping, unused in this example
                                                                             text_a=x[DATA_COLUMN],
                                                                             text_b=None,
                                                                             label=x[LABEL_COLUMN]),
                                                                             axis=1)

test_InputExamples = test.apply(lambda x: bert.run_classifier.InputExample(guid=None,
                                                                           text_a=x[DATA_COLUMN],
                                                                           text_b=None,
                                                                           label=x[LABEL_COLUMN]),
                                                                           axis=1)

# This is a path to an uncased (all lowercase) version of BERT
BERT_MODEL_HUB = "https://tfhub.dev/google/bert_uncased_L-12_H-768_A-12/1"

def create_tokenizer_from_hub_module():
    """Get the vocab file and casing info from the Hub module."""
    with tf.Graph().as_default():
        bert_module = hub.Module(BERT_MODEL_HUB)
        tokenization_info = bert_module(signature="tokenization_info", as_dict=True)
        with tf.Session() as sess:
            vocab_file, do_lower_case = sess.run([tokenization_info["vocab_file"],
                                                  tokenization_info["do_lower_case"]])

    return bert.tokenization.FullTokenizer(
        vocab_file=vocab_file, do_lower_case=do_lower_case)

tokenizer = create_tokenizer_from_hub_module()

# We'll set sequences to be at most 128 tokens long.
MAX_SEQ_LENGTH = 128
# Convert our train and test features to InputFeatures that BERT understands.
train_features = bert.run_classifier.convert_examples_to_features(train_InputExamples, label_list, MAX_SEQ_LENGTH, tokenizer)
test_features = bert.run_classifier.convert_examples_to_features(test_InputExamples, label_list, MAX_SEQ_LENGTH, tokenizer)

def create_model(is_predicting, input_ids, input_mask, segment_ids, labels,
                 num_labels):
    """Creates a classification model."""

    bert_module = hub.Module(
        BERT_MODEL_HUB,
        trainable=True)
    bert_inputs = dict(
        input_ids=input_ids,
        input_mask=input_mask,
        segment_ids=segment_ids)
    bert_outputs = bert_module(
        inputs=bert_inputs,
        signature="tokens",
        as_dict=True)

    # Use "pooled_output" for classification tasks on an entire sentence.
    # Use "sequence_outputs" for token-level output.
    output_layer = bert_outputs["pooled_output"]

    hidden_size = output_layer.shape[-1].value

    # Create our own layer to tune for politeness data.
    output_weights = tf.get_variable(
        "output_weights", [num_labels, hidden_size],
        initializer=tf.truncated_normal_initializer(stddev=0.02))

    output_bias = tf.get_variable(
        "output_bias", [num_labels], initializer=tf.zeros_initializer())

    with tf.variable_scope("loss"):
        # Dropout helps prevent overfitting
        output_layer = tf.nn.dropout(output_layer, keep_prob=0.9)

        logits = tf.matmul(output_layer, output_weights, transpose_b=True)
        logits = tf.nn.bias_add(logits, output_bias)
        log_probs = tf.nn.log_softmax(logits, axis=-1)

        # Convert labels into one-hot encoding
        one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32)

        predicted_labels = tf.squeeze(tf.argmax(log_probs, axis=-1, output_type=tf.int32))
        # If we're predicting, we want predicted labels and the probabiltiies.
        if is_predicting:
            return (predicted_labels, log_probs)

        # If we're train/eval, compute loss between predicted and actual label
        per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
        loss = tf.reduce_mean(per_example_loss)
        return (loss, predicted_labels, log_probs)

# model_fn_builder actually creates our model function
# using the passed parameters for num_labels, learning_rate, etc.
def model_fn_builder(num_labels, learning_rate, num_train_steps,
                     num_warmup_steps):
    """Returns `model_fn` closure for TPUEstimator."""
    def model_fn(features, labels, mode, params):  # pylint: disable=unused-argument
        """The `model_fn` for TPUEstimator."""

        input_ids = features["input_ids"]
        input_mask = features["input_mask"]
        segment_ids = features["segment_ids"]
        label_ids = features["label_ids"]

        is_predicting = (mode == tf.estimator.ModeKeys.PREDICT)

        # TRAIN and EVAL
        if not is_predicting:

            (loss, predicted_labels, log_probs) = create_model(
                is_predicting, input_ids, input_mask, segment_ids, label_ids, num_labels)

            train_op = bert.optimization.create_optimizer(
                loss, learning_rate, num_train_steps, num_warmup_steps, use_tpu=False)

            # Calculate evaluation metrics.
            def metric_fn(label_ids, predicted_labels):
                accuracy = tf.metrics.accuracy(label_ids, predicted_labels)
                f1_score = tf.contrib.metrics.f1_score(
                    label_ids,
                    predicted_labels)
                auc = tf.metrics.auc(
                    label_ids,
                    predicted_labels)
                recall = tf.metrics.recall(
                    label_ids,
                    predicted_labels)
                precision = tf.metrics.precision(
                    label_ids,
                    predicted_labels)
                true_pos = tf.metrics.true_positives(
                    label_ids,
                    predicted_labels)
                true_neg = tf.metrics.true_negatives(
                    label_ids,
                    predicted_labels)
                false_pos = tf.metrics.false_positives(
                    label_ids,
                    predicted_labels)
                false_neg = tf.metrics.false_negatives(
                    label_ids,
                    predicted_labels)
                return {
                    "eval_accuracy": accuracy,
                    "f1_score": f1_score,
                    "auc": auc,
                    "precision": precision,
                    "recall": recall,
                    "true_positives": true_pos,
                    "true_negatives": true_neg,
                    "false_positives": false_pos,
                    "false_negatives": false_neg
                }

            eval_metrics = metric_fn(label_ids, predicted_labels)

            if mode == tf.estimator.ModeKeys.TRAIN:
                return tf.estimator.EstimatorSpec(mode=mode,
                                                  loss=loss,
                                                  train_op=train_op)
            else:
                return tf.estimator.EstimatorSpec(mode=mode,
                                                  loss=loss,
                                                  eval_metric_ops=eval_metrics)
        else:
            (predicted_labels, log_probs) = create_model(
                is_predicting, input_ids, input_mask, segment_ids, label_ids, num_labels)

            predictions = {
                'probabilities': log_probs,
                'labels': predicted_labels
            }
            return tf.estimator.EstimatorSpec(mode, predictions=predictions)

    # Return the actual model function in the closure
    return model_fn

# Compute train and warmup steps from batch size
# These hyperparameters are copied from this colab notebook (https://colab.sandbox.google.com/github/tensorflow/tpu/blob/master/tools/colab/bert_finetuning_with_cloud_tpus.ipynb)
BATCH_SIZE = 32
LEARNING_RATE = 2e-5
NUM_TRAIN_EPOCHS = 3.0
# Warmup is a period of time where hte learning rate
# is small and gradually increases--usually helps training.
WARMUP_PROPORTION = 0.1
# Model configs
SAVE_CHECKPOINTS_STEPS = 500
SAVE_SUMMARY_STEPS = 100

# Compute # train and warmup steps from batch size
num_train_steps = int(len(train_features) / BATCH_SIZE * NUM_TRAIN_EPOCHS) + 1

num_warmup_steps = int(num_train_steps * WARMUP_PROPORTION)

# Specify outpit directory and number of checkpoint steps to save
run_config = tf.estimator.RunConfig(
    model_dir=OUTPUT_DIR,
    save_summary_steps=SAVE_SUMMARY_STEPS,
    save_checkpoints_steps=SAVE_CHECKPOINTS_STEPS)

model_fn = model_fn_builder(
    num_labels=len(label_list),
    learning_rate=LEARNING_RATE,
    num_train_steps=num_train_steps,
    num_warmup_steps=num_warmup_steps)

estimator = tf.estimator.Estimator(
    model_fn=model_fn,
    config=run_config,
    params={"batch_size": BATCH_SIZE})

# Create an input function for training. drop_remainder = True for using TPUs.
train_input_fn = bert.run_classifier.input_fn_builder(
    features=train_features,
    seq_length=MAX_SEQ_LENGTH,
    is_training=True,
    drop_remainder=False)

print(f'Beginning Training!')
current_time = datetime.now()
estimator.train(input_fn=train_input_fn, max_steps=num_train_steps)
print("Training took time ", datetime.now() - current_time)

test_input_fn = run_classifier.input_fn_builder(
    features=test_features,
    seq_length=MAX_SEQ_LENGTH,
    is_training=False,
    drop_remainder=False)

estimator.evaluate(input_fn=test_input_fn, steps=None)

With a small part of the dataset:

__label__6  porsche 924 toute pièces
__label__29 dji-mavic-pro - combo-nombreux-accessoires
__label__43 console ps4 500 go avec jeux
__label__41 boîte de jeu les guignols de l ' info - ( lou72 ) - 8€
__label__21 fendeuse de bûches
__label__21 Équerre stanley 250mm
__label__39 2 étagères
__label__57 tondeuse viking mb 655 vs
__label__19 table d exterieur en teck + 8 chaises
__label__34 film solaire pose rapide

Also if I replace all my labels with __label__1 and __label__0 it works.

I'm certainly doing something wrong but I do not see what. Thanks in advance for any help :)

jplu commented 5 years ago

Ok, sorry was my bad. After having activated to logs I have seen that I was loading an old checkpoint. Sorrry.

siabar commented 5 years ago

Hi, I am doing the same, and have the same error. Can you explain to me how did you solve this problem?

Single430 commented 5 years ago

@jplu Can you tell me the results of your multi-category? Accuracy, recall rate FB1?

Thanks very much! super.single430

jplu commented 5 years ago

Well, on my dataset the accuracy @1 is around 91% and the accuracy @6 is 100%. I did not compute the precision, recall and F1 because there was not point for me to have these scores.

Single430 commented 5 years ago

@jplu Thank you very much for your reply，I will try to run your code and hope that I can get good results too.

Thanks very much! super.single430

fayandjeanie commented 5 years ago

I use your codes and my own dataset, with three label, label_list = ['p','n','0']. the data set only contain news text and a label with three value listed above. after evaluate, I keep getting this message:

InvalidArgumentError (see above for traceback): assertion failed: [predictions must be in [0, 1]] [Condition x <= y did not hold element-wise:x (f1/remove_squeezable_dimensions/cond/Merge:0) = ] [0 2 0...] [y (f1/Cast_1:0) = ] [1]
     [[node f1/assert_less_equal/Assert/Assert (defined at <ipython-input-34-77345aa6cf2c>:30) ]]

but I thought there is one hot encoding step in the create_model function, why I still getting this error? Could anyone help me? Thank you.

chikubee commented 5 years ago

Hey @jplu , I am getting a similar error even after appending label to class names.

InvalidArgumentError: assertion failed: [predictions must be in [0, 1]] [Condition x <= y did not hold element-wise:x (loss/Squeeze:0) = ] [0 1 0...] [y (auc/Cast_1:0) = ] [1] [[node auc/assert_less_equal/Assert/Assert (defined at :34) ]]

Any leads?

Roechiiii commented 5 years ago

@chikubee I got the same error, have you already solved it?

chikubee commented 5 years ago

@Roechiiii check your label_list, it should be of the form ['0', '1', '2'] for example, or you can just update the Cola Processor to return this as the label_list if not passed explicitly. Also, make sure your labels are strings in the training data.

Roechiiii commented 5 years ago

@chikubee Thank you very much for your quick reply. I tried to change it, but there is still the error in the evaluation process. I also figured out, that the model based on the offical colab [https://colab.research.google.com/github/google-research/bert/blob/master/predicting_movie_reviews_with_bert_on_tf_hub.ipynb#scrollTo=OsrbTD2EJTVl](Predicting Movie Reviews with BERT on TF Hub.ipynb) a binary classification problem is. When I predict the values with that certain model, I only get two classes back.

i-firstofmyname commented 3 years ago

@chikubee @Roechiiii Have you found a way to solve this issue? @Roechiiii says changing label_list to ['0','1','2'] is still giving the same error

google-research / bert

Issue with multiclass text classification #449