TF2 version of Multilingual DistilBERT throws an exception [TensorFlow 2]

amaiya commented 4 years ago

🐛 Bug

I'm finding that several of the TensorFlow 2.0 Sequence Classification models don't seem to work. Case in point: distilbert-base-uncased works but distilbert-base-multilingual-cased does not.

My environment is:

Platform Linux-4.15.0-65-generic-x86_64-with-Ubuntu-18.04-bionic
Python 3.6.8 (default, Oct 7 2019, 12:59:55)
[GCC 8.3.0]
Tensorflow 2.0.0

Note that I am using v2.3.0 of transformers with patch 1efc208 applied to work around this issue.

However, problems with distilbert-base-multilingual-cased occur in v2.2.0, as well.

Here is code to reproduce the problem.

# define constants
MODEL_NAME = 'distilbert-base-multilingual-cased'  # DOES NOT WORK
# MODEL_NAME = 'distilbert-base-uncased'   # WORKS if uncommented

BATCH_SIZE=6
MAX_SEQ_LEN = 500

# imports and setup
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID";
os.environ["CUDA_VISIBLE_DEVICES"]="0"; 
import tensorflow as tf
from transformers import glue_convert_examples_to_features
from transformers import BertConfig, TFBertForSequenceClassification, BertTokenizer
from transformers import XLNetConfig, TFXLNetForSequenceClassification, XLNetTokenizer
from transformers import XLMConfig, TFXLMForSequenceClassification, XLMTokenizer
from transformers import RobertaConfig, TFRobertaForSequenceClassification, RobertaTokenizer
from transformers import DistilBertConfig, TFDistilBertForSequenceClassification, DistilBertTokenizer
from transformers import AlbertConfig, TFAlbertForSequenceClassification, AlbertTokenizer

TRANSFORMER_MODELS = {
    'bert':       (BertConfig, TFBertForSequenceClassification, BertTokenizer),
    'xlnet':      (XLNetConfig, TFXLNetForSequenceClassification, XLNetTokenizer),
    'xlm':        (XLMConfig, TFXLMForSequenceClassification, XLMTokenizer),
    'roberta':    (RobertaConfig, TFRobertaForSequenceClassification, RobertaTokenizer),
    'distilbert': (DistilBertConfig, TFDistilBertForSequenceClassification, DistilBertTokenizer),
    'albert':     (AlbertConfig, TFAlbertForSequenceClassification, AlbertTokenizer),
}

def classes_from_name(model_name):
    name = model_name.split('-')[0]
    return TRANSFORMER_MODELS[name]

# setup model and tokenizer
(config_class, model_class, tokenizer_class) = classes_from_name(MODEL_NAME)
tokenizer = tokenizer_class.from_pretrained(MODEL_NAME)
model = model_class.from_pretrained(MODEL_NAME)

# construct binary classification dataset
categories = ['alt.atheism', 'comp.graphics']
from sklearn.datasets import fetch_20newsgroups
train_b = fetch_20newsgroups(subset='train',
   categories=categories, shuffle=True, random_state=42)
test_b = fetch_20newsgroups(subset='test',
   categories=categories, shuffle=True, random_state=42)

print('size of training set: %s' % (len(train_b['data'])))
print('size of validation set: %s' % (len(test_b['data'])))
print('classes: %s' % (train_b.target_names))

x_train = train_b.data
y_train = train_b.target
x_test = test_b.data
y_test = test_b.target

train_csv = [(i, text, y_train[i]) for i, text in enumerate(x_train)]
valid_csv = [(i, text, y_test[i]) for i, text in enumerate(x_test)]

def convert_to_tfdataset(csv):
    def gen():
        for ex in csv:
            yield  {'idx': ex[0],
                     'sentence': ex[1],
                     'label': str(ex[2])}
    return tf.data.Dataset.from_generator(gen,
        {'idx': tf.int64,
          'sentence': tf.string,
          'label': tf.int64})
trn = convert_to_tfdataset(train_csv)
val = convert_to_tfdataset(valid_csv)

# preprocess datasets
train_dataset = glue_convert_examples_to_features(examples=trn, tokenizer=tokenizer
                                                  , max_length=MAX_SEQ_LEN, task='sst-2'
                                                  , label_list =['0', '1'])
valid_dataset = glue_convert_examples_to_features(examples=val, tokenizer=tokenizer
                                                  , max_length=MAX_SEQ_LEN, task='sst-2'
                                                  , label_list =['0', '1'])
train_dataset = train_dataset.shuffle(len(train_csv)).batch(BATCH_SIZE).repeat(-1)
valid_dataset = valid_dataset.batch(BATCH_SIZE)

# train model
opt = tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
model.compile(optimizer=opt, loss=loss, metrics=[metric])
history = model.fit(train_dataset, epochs=1, steps_per_epoch=len(train_csv)//BATCH_SIZE,
                    validation_data=valid_dataset, validation_steps=len(valid_csv)//BATCH_SIZE)

The code above produces the following error:

/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/training_utils.py in standardize_input_data(data, names, shapes, check_batch_axis, exception_prefix)
    529                        'Expected to see ' + str(len(names)) + ' array(s), '
    530                        'but instead got the following list of ' +
--> 531                        str(len(data)) + ' arrays: ' + str(data)[:200] + '...')
    532     elif len(names) > 1:
    533       raise ValueError('Error when checking model ' + exception_prefix +

ValueError: Error when checking model target: the list of Numpy arrays that you are passing to your model is not the size the model expected. Expected to see 8 array(s), but instead got the following list of 1 arrays: [<tf.Tensor 'ExpandDims:0' shape=(None, 1) dtype=int64>]...

However, if you set MODEL_NAME to distilbert-base-uncased, everything works.

Other models that I've found do not work in TF2 include xlnet-base-cased. To reproduce, set MODEL_NAME to xlnet-base-cased in the code above. The xlnet-base-cased model also throws an exception during the call to model.fit.

igormis commented 4 years ago

THe same error happens to me with the distilbert-base-multilingual-cased

jplu commented 4 years ago

Hello !

I got the same error. After having investigated a bit, I found that the error is because the field output_hidden_states in the configuration file of the model distilbert-base-multilingual-cased is set to true instead of false. As a workaround you can do:

config = DistilBertConfig.from_pretrained("distilbert-base-multilingual-cased", output_hidden_states=False)
model = TFDistilBertForSequenceClassification.from_pretrained("distilbert-base-multilingual-cased", config=config)

And it will works.

@julien-c or @LysandreJik maybe it would be better to update the config file in the S3 repo, what do you think? In order to be aligned with the other models.

LysandreJik commented 4 years ago

Hi, thank you all for raising this issue and looking into it. As @jplu mentioned, this was an issue with the output_hidden_states in the configuration files. It was the case for two different checkpoints: distilbert-base-multilingual-cased and distilbert-base-german-cased.

I've updated the files on S3 and could successfully run the your script @amaiya.

amaiya commented 4 years ago

Thanks @jplu and @LysandreJik
Works great now:

# construct toy text classification dataset
categories = ['alt.atheism', 'comp.graphics']
from sklearn.datasets import fetch_20newsgroups
train_b = fetch_20newsgroups(subset='train',
   categories=categories, shuffle=True, random_state=42)
test_b = fetch_20newsgroups(subset='test',
   categories=categories, shuffle=True, random_state=42)
x_train = train_b.data
y_train = train_b.target
x_test = test_b.data
y_test = test_b.target

# train with ktrain interface to transformers
import ktrain
from ktrain import text
t = text.Transformer('distilbert-base-multilingual-cased', maxlen=500,  classes=train_b.target_names)
trn = t.preprocess_train(x_train, y_train)
val = t.preprocess_test(x_test, y_test)
model = t.get_classifier()
learner = ktrain.get_learner(model, train_data=trn, val_data=val, batch_size=6)
learner.fit_onecycle(3e-5, 1)

begin training using onecycle policy with max lr of 3e-05...
Train for 178 steps, validate for 118 steps
178/178 [==============================] - 51s 286ms/step - loss: 0.2541 - accuracy: 0.8816 - val_loss: 0.0862 - val_accuracy: 0.9746

huggingface / transformers

TF2 version of Multilingual DistilBERT throws an exception [TensorFlow 2] #2462

🐛 Bug