huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
133.65k stars 26.7k forks source link

Training TFBertForQuestionAnswering on custom SquadV1 data #4397

Closed yonatanbitton closed 4 years ago

yonatanbitton commented 4 years ago

Hello.

TLDR: If there is any minimal code that trains a TFBertForQuestionAnswering on custom squad-v1 (not from nlp.load_dataset)

I've tried in several ways and encountered some problems.

This is the minimal code i'm trying to activate:

    args = argparse.Namespace(**bert_config)
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

    processor = SquadV1Processor()
    # processor = SquadV2Processor()
    examples = processor.get_train_examples(args.data_dir, filename=args.train_file)
    train_dataset = squad_convert_examples_to_features(
        examples=examples,
        tokenizer=tokenizer,
        max_seq_length=args.max_seq_length,
        doc_stride=args.doc_stride,
        max_query_length=args.max_query_length,
        is_training=True,
        return_dataset="tf"
    )

    model = TFBertForQuestionAnswering.from_pretrained("bert-base-cased")
    loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(reduction=tf.keras.losses.Reduction.NONE, from_logits=True)
    opt = tf.keras.optimizers.Adam(learning_rate=3e-5)

    model.compile(optimizer=opt,
                  loss={'start_position': loss_fn, 'end_position': loss_fn},
                  loss_weights={'start_position': 1., 'end_position': 1.},
                  metrics=['accuracy'])

    # Now let's train our model
    try:
        history = model.fit(train_dataset, epochs=1, steps_per_epoch=3)
    except Exception as ex:
        print(f"Failed using fit, {ex}")
        history = model.fit_generator(train_dataset, epochs=1, steps_per_epoch=3)

The current errors are: with fit:

    x = standardize_function(x)
  File "/home/ec2-user/anaconda3/envs/yonatan_env_tf2/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 660, in standardize_function
    standardize(dataset, extract_tensors_from_dataset=False)
  File "/home/ec2-user/anaconda3/envs/yonatan_env_tf2/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training.py", line 2360, in _standardize_user_data
    self._compile_from_inputs(all_inputs, y_input, x, y)
  File "/home/ec2-user/anaconda3/envs/yonatan_env_tf2/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training.py", line 2580, in _compile_from_inputs
    target, self.outputs)
  File "/home/ec2-user/anaconda3/envs/yonatan_env_tf2/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_utils.py", line 1341, in cast_if_floating_dtype_and_mismatch
    if target.dtype != out.dtype:
AttributeError: 'str' object has no attribute 'dtype'

with fit_generator:

ValueError: Unknown entries in loss dictionary: ['start_position', 'end_position']. Only expected following keys: ['output_1', 'output_2']

The dataset that returns from squad_convert_examples_to_features is of type- tensorflow.python.data.ops.dataset_ops.FlatMapDataset and i'm not sure how to change it's columns from start_position to output_1 and end_position to output_2. I've also asked it on stackoverflow: https://stackoverflow.com/questions/61830361/how-the-change-column-name-in-tensorflow-flatmapdataset

I've seen the colab tutorial of the nlp package. It has simple code:

train_tf_dataset = nlp.load_dataset('squad', split="train")
tokenizer = BertTokenizerFast.from_pretrained('bert-base-cased')
def convert_to_tf_features(example_batch):
    # Tokenize contexts and questions (as pairs of inputs)
    input_pairs = list(zip(example_batch['context'], example_batch['question']))
    encodings = tokenizer.batch_encode_plus(input_pairs, pad_to_max_length=True)

    # Compute start and end tokens for labels using Transformers's fast tokenizers alignement methods.
    start_positions, end_positions = [], []
    for i, (context, answer) in enumerate(zip(example_batch['context'], example_batch['answers'])):
        start_idx, end_idx = get_correct_alignement(context, answer)
        start_positions.append([encodings.char_to_token(i, start_idx)])
        end_positions.append([encodings.char_to_token(i, end_idx-1)])

    if start_positions and end_positions:
      encodings.update({'start_positions': start_positions,
                        'end_positions': end_positions})
    return encodings

train_tf_dataset = train_tf_dataset.map(convert_to_tf_features, batched=True)

columns = ['input_ids', 'token_type_ids', 'attention_mask', 'start_positions', 'end_positions']
train_tf_dataset.set_format(type='tensorflow', columns=columns)
features = {x: train_tf_dataset[x] for x in columns[:3]} 
labels = {"output_1": train_tf_dataset["start_positions"]}
labels["output_2"] = train_tf_dataset["end_positions"]
tfdataset = tf.data.Dataset.from_tensor_slices((features, labels)).batch(8)
# Let's load a pretrained TF2 Bert model and a simple optimizer
from transformers import TFBertForQuestionAnswering

model = TFBertForQuestionAnswering.from_pretrained("bert-base-cased")
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(reduction=tf.keras.losses.Reduction.NONE, from_logits=True)
opt = tf.keras.optimizers.Adam(learning_rate=3e-5)
model.compile(optimizer=opt,
              loss={'output_1': loss_fn, 'output_2': loss_fn},
              loss_weights={'output_1': 1., 'output_2': 1.},
              metrics=['accuracy'])
# Now let's train our model
model.fit(tfdataset, epochs=1, steps_per_epoch=3)

I can't do the same as this code, because the dataset here is of type - nlp.arrow_dataset.Dataset. I've tried to convert my tensorflow.python.data.ops.dataset_ops.FlatMapDataset to nlp.arrow_dataset.Dataset (and then mimic the last code here) but didn't find suitable way.

Edit: I've succeeded to change the names of the output in the FlatMapDataset to output_1 and output_2, and now I receive the following error:

tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
  (0) Invalid argument:  logits and labels must have the same first dimension, got logits shape [384,1] and labels shape [1]
     [[node loss/output_1_loss/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits (defined at /yonatab/ZeroShot/transformers_experiments/src/minimal_example_for_git.py:53) ]]
     [[Reshape_820/_546]]
  (1) Invalid argument:  logits and labels must have the same first dimension, got logits shape [384,1] and labels shape [1]
     [[node loss/output_1_loss/SparseSoftmaxCrossEntropyWithLogits/SparseSoftmaxCrossEntropyWithLogits (defined at /yonatab/ZeroShot/transformers_experiments/src/minimal_example_for_git.py:53) ]]

How can I create a tf dataset with squad_convert_examples_to_features (and return type tf) and train a TF model on it?

Thanks

yonatanbitton commented 4 years ago

I succeeded to do it somehow, but i'm sure it's not the way it should work, and it won't scale well for large datasets. I would be happy to know if there is a better way.

What worked:

  1. squad_convert_examples_to_features ( return_dataset = False) - getting the features
  2. Creating a dictionary of features and labels, where each item is list of tensorflow vectors obtained by tf.convert_to_tensor
  3. Constructing the dataset with tfdataset = tf.data.Dataset.from_tensor_slices((features, labels)).batch(8)
  4. Training with fit_generator method (fit fails)

Full code:

    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

    processor = SquadV1Processor()
    # processor = SquadV2Processor()
    examples = processor.get_train_examples(args.data_dir, filename=args.train_file)
    train_dataset = squad_convert_examples_to_features(
        examples=examples,
        tokenizer=tokenizer,
        max_seq_length=args.max_seq_length,
        doc_stride=args.doc_stride,
        max_query_length=args.max_query_length,
        is_training=True
    )

    def create_features_and_labels_tf_tensors_from_dataset(train_dataset):
        all_input_ids = []
        all_token_type_ids = []
        all_attention_mask = []
        all_start_pos = []
        all_end_pos = []
        ex: SquadFeatures
        for ex in train_dataset:
            all_input_ids.append(ex.input_ids)
            all_token_type_ids.append(ex.token_type_ids)
            all_attention_mask.append(ex.attention_mask)
            all_start_pos.append(ex.start_position)
            all_end_pos.append(ex.end_position)
        all_input_ids_tensor = tf.convert_to_tensor(all_input_ids)
        all_token_type_ids_tensor = tf.convert_to_tensor(all_token_type_ids)
        all_attention_mask_tensor = tf.convert_to_tensor(all_attention_mask)
        all_start_pos_tensor = tf.convert_to_tensor(all_start_pos)
        all_end_pos_tensor = tf.convert_to_tensor(all_end_pos)
        features = {'input_ids': all_input_ids_tensor, 'token_type_ids': all_token_type_ids_tensor,
                    'attention_mask': all_attention_mask_tensor}
        labels = {"output_1": all_start_pos_tensor, 'output_2': all_end_pos_tensor}
        return features, labels

    features, labels = create_features_and_labels_tf_tensors_from_dataset(train_dataset)
    tfdataset = tf.data.Dataset.from_tensor_slices((features, labels)).batch(8)

    model = TFBertForQuestionAnswering.from_pretrained("bert-base-cased")
    loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(reduction=tf.keras.losses.Reduction.NONE, from_logits=True)
    opt = tf.keras.optimizers.Adam(learning_rate=3e-5)

    model.compile(optimizer=opt,
                  loss={'output_1': loss_fn, 'output_2': loss_fn},
                  loss_weights={'output_1': 1., 'output_2': 1.},
                  metrics=['accuracy'])

    # Now let's train our model
    try:
        history = model.fit(tfdataset, epochs=1, steps_per_epoch=3)
        print(f'Success with fit')
    except Exception as ex:
        traceback.print_exc()
        print(f"Failed using fit, {ex}")
        history = model.fit_generator(tfdataset, epochs=1, steps_per_epoch=3)
        print(f'Success with fit_generator')
    print("Done")

Error message for fit:

File "/home/ec2-user/yonatab/ZeroShot/transformers_experiments/src/minimal_example_for_git.py", line 73, in main
    history = model.fit(tfdataset, epochs=1, steps_per_epoch=3)
  File "/home/ec2-user/anaconda3/envs/yonatan_env_tf2/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training.py", line 819, in fit
    use_multiprocessing=use_multiprocessing)
  File "/home/ec2-user/anaconda3/envs/yonatan_env_tf2/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 235, in fit
    use_multiprocessing=use_multiprocessing)
  File "/home/ec2-user/anaconda3/envs/yonatan_env_tf2/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 593, in _process_training_inputs
    use_multiprocessing=use_multiprocessing)
  File "/home/ec2-user/anaconda3/envs/yonatan_env_tf2/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 706, in _process_inputs
    use_multiprocessing=use_multiprocessing)
  File "/home/ec2-user/anaconda3/envs/yonatan_env_tf2/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/data_adapter.py", line 702, in __init__
    x = standardize_function(x)
  File "/home/ec2-user/anaconda3/envs/yonatan_env_tf2/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2.py", line 660, in standardize_function
    standardize(dataset, extract_tensors_from_dataset=False)
  File "/home/ec2-user/anaconda3/envs/yonatan_env_tf2/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training.py", line 2360, in _standardize_user_data
    self._compile_from_inputs(all_inputs, y_input, x, y)
  File "/home/ec2-user/anaconda3/envs/yonatan_env_tf2/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training.py", line 2580, in _compile_from_inputs
    target, self.outputs)
  File "/home/ec2-user/anaconda3/envs/yonatan_env_tf2/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_utils.py", line 1341, in cast_if_floating_dtype_and_mismatch
    if target.dtype != out.dtype:
AttributeError: 'str' object has no attribute 'dtype'
Failed using fit, 'str' object has no attribute 'dtype'
WARNING:tensorflow:From /home/ec2-user/yonatab/ZeroShot/transformers_experiments/src/minimal_example_for_git.py:78: Model.fit_generator (from tensorflow.python.keras.engine.training) is deprecated and will be removed in a future version.
Instructions for updating:
Please use Model.fit, which supports generators.

It also fails when trying to add validation_data to the fit function

yonatanbitton commented 4 years ago

I think it's a bug, i'm closing & opening another bug issue.