google-research / text-to-text-transfer-transformer

Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"
https://arxiv.org/abs/1910.10683
Apache License 2.0
6.15k stars 755 forks source link

Predictions elicited from `hf_model.py` do no match that of HuggingFace #463

Closed danyaljj closed 3 years ago

danyaljj commented 4 years ago

Hey there! đź‘‹

TLDR; I have this t5-small model that is fine-tuned on natural-questions. For this model, I get its predictions once using hf_model.py and another time using HF code. The outputs are different (and the outputs using HF seem to be more reasonable).

This is a thread on using hf_model.py; I know that this code is not a well-tested code. Sharing these observations here in case they help you improve this model.

  1. When I try to make predictions using hf_model:
    import functools
    import t5
    import torch
    import transformers
    if torch.cuda.is_available():
    device = torch.device("cuda")
    else:
    device = torch.device("cpu")
    path = "/home/danielk/small_standard/pytorch_model/"
    model = t5.models.HfPyTorchModel(path, path, device)
    # Generate some predictions
    inputs = [
    "Who is the US president? ",
    "How many states are there in USA? ",
    "who got the first nobel prize in physics?",
    "when is the next deadpool movie being released?",
    "which mode is used for short wave broadcast service?"
    ]
    model.predict(
    inputs,
    sequence_length={"inputs": 32},
    batch_size=2,
    output_file=f"{path}/example_predictions.txt",
    )
    print("done making the predictions . . . ")

and here is the output:

2020-10-21 21:22:46.322911: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-10-21 21:22:46.324427: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-10-21 21:22:46.324456: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263]      
/home/danielk/text-to-text-transfer-transformer/t5/models/hf_model.py:547: UserWarning: Creating resources inside a function passed to Dataset.map() is not supported. Create each resource outside the function, and capture it inside the function to use it.
  dataset = dataset.map(
INFO:absl:Who is the US president? 
  -> Who Who Donald Donald Donald Donald Donald Donald Donald Donald Donald Donald Donald Donald Donald Donald Donald Donald Donald
INFO:absl:How many states are there in USA? 
  -> 5 states
INFO:absl:who got the first nobel prize in physics?
  -> Wilhelm Conrad Röntgen
INFO:absl:when is the next deadpool movie being released?
  -> 2018 2018 2018 2018 2018 2018 2018
INFO:absl:which mode is used for short wave broadcast service?
  -> on on on on on on on on
done making the predictions . . . 
  1. Now I try to use the same model using HF code:
    
    from transformers import T5Config, T5Tokenizer, T5ForConditionalGeneration

path = "/home/danielk/small_standard/pytorch_model" model = T5ForConditionalGeneration.from_pretrained(path) tokenizer = T5Tokenizer.from_pretrained(path) model.eval()

def run_model(input_string, generator_args): input_ids = tokenizer.encode(input_string, return_tensors="pt") res = model.generate(input_ids, generator_args) tokens = [tokenizer.decode(x) for x in res] print(tokens)

run_model("how many states does the US has? ") run_model("who is the US president?") run_model("who got the first nobel prize in physics?") run_model("when is the next deadpool movie being released?") run_model("which mode is used for short wave broadcast service?") run_model("the south west wind blows across nigeria between?")


Here is the output: 

2020-10-21 21:14:44.634221: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory 2020-10-21 21:14:44.634259: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. ['50'] ['Donald Trump'] ['Wilhelm Conrad Röntgen'] ['December 18, 2018'] ['TCP port 25'] ['the Nigerian and Pacific Oceans']

craffel commented 4 years ago

I am guessing the issue is that it is not automatically loading your latest trained checkpoint. It tries to automatically load a checkpoint on initialization: https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/models/hf_model.py#L200 using load_latest_checkpoint: https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/models/hf_model.py#L270 But the checkpoint format I implemented for the HF model is probably different from whatever format is living in your "/home/danielk/small_standard/pytorch_model/" directory. Do you want to take a look at the checkpoint saving/loading logic and the naming conventions etc. and confirm that it doesn't match the checkpoint format you're using? If not, I would be open to changing the checkpoint convention so that it matches something more standard. I just made up a format that I thought was reasonable. Maybe there is a more standard format/naming convention we could use?

ghost commented 3 years ago

I believe he has used this checkpoint here: https://huggingface.co/t5-small this is also my issue, it would be really helpful if you could assist in solving this issue. thanks.

ghost commented 3 years ago

I am not sure how to use load_latest_checkpoint, could you please tell me how I can add this? thanks

ghost commented 3 years ago

I looked at load_checkpoint and save_checkpoint, this seems to be normal pytorch code, and the error to me is not due to other checkpoint format, could you have a closer look please? thanks

rabeehkarimimahabadi commented 3 years ago

Hi Julia, If you are using the released checkpoint in huggingface repo, it is because the model is not trained on this task and the results are expected. I changed the evaluation to a task that T5 is trained on to evaluate how well the results match the other implementation, here are the results of two models, and they match well:

1) HuggingFace T5 model

from transformers import T5Config, T5Tokenizer, T5ForConditionalGeneration

model = T5ForConditionalGeneration.from_pretrained("t5-small")
tokenizer = T5Tokenizer.from_pretrained("t5-small")
model.eval()

def run_model(input_string, **generator_args):
    input_ids = tokenizer.encode(input_string, return_tensors="pt")
    res = model.generate(input_ids, **generator_args)
    tokens = [tokenizer.decode(x) for x in res]
    print(tokens)

run_model("translate English to German: how many states does the US has? ")
run_model("translate English to German: who is the US president?")
run_model("translate English to German: who got the first nobel prize in physics?")
run_model("translate English to German: when is the next deadpool movie being released?")
run_model("translate English to German: which mode is used for short wave broadcast service?")
run_model("translate English to German: the south west wind blows across nigeria between?")

Results:

['Wie viele Staaten haben die USA?']
['Wer ist der US-Präsident?']
['wer hat den ersten Nobelpreis in der Physik erhalten?']
['wann wird der nächste Deadpool-Film veröffentlicht?']
['Welchen Modus wird fĂĽr Kurzwellenstrahlung verwendet?']
['der Südwestwind bläst durchnigeria zwischen?']

2) HF T5 model

import functools
import t5
import torch
import transformers
if torch.cuda.is_available():
    device = torch.device("cuda")
else:
    device = torch.device("cpu")
model = t5.models.HfPyTorchModel("t5-small", "/tmp/hft5/", device)
# Generate some predictions
inputs = [
    "translate English to German: how many states does the US has? ",
    "translate English to German: who is the US president?",
    "translate English to German: who got the first nobel prize in physics?",
    "translate English to German: when is the next deadpool movie being released?",
    "translate English to German: which mode is used for short wave broadcast service?",
    "translate English to German: the south west wind blows across nigeria between?",
]
model.predict(
    inputs,
    sequence_length={"inputs": 32},
    batch_size=2,
)

Results:

/usr/local/lib/python3.6/dist-packages/t5/models/hf_model.py:549: UserWarning: Creating resources inside a function passed to Dataset.map() is not supported. Create each resource outside the function, and capture it inside the function to use it.
  num_parallel_calls=tf.data.experimental.AUTOTUNE,
INFO:absl:translate English to German: how many states does the US has? 
  -> Wie viele Staaten haben die USA?
INFO:absl:translate English to German: who is the US president?
  -> Wer ist der US-Präsident?
INFO:absl:translate English to German: who got the first nobel prize in physics?
  -> Wer hat den ersten Nobelpreis in der Physik erhalten?
INFO:absl:translate English to German: when is the next deadpool movie being released?
  -> ob der nächste Deadpool-Film veröffentlicht wird?
INFO:absl:translate English to German: which mode is used for short wave broadcast service?
  -> Welchen Modus wird fĂĽr Kurzwellenstrahlung verwendet?
INFO:absl:translate English to German: the south west wind blows across nigeria between?
  -> Der Südwestwind bläst zwischennigeria?
rabeehkarimimahabadi commented 3 years ago

I tried to finetune the released pytorch code on WMT, the blue score I am getting was around 1 after 50000 steps, I am pretty sure there is a bug in the data processing pipeline, and that this does not match Huggingface model decoding.

craffel commented 3 years ago

We verified that the results for translation are roughly the same in the past. https://github.com/huggingface/transformers/issues/5543

rabeehkarimimahabadi commented 3 years ago

Hi Colin Is this with pytorch version? In the discussion it seems it is with tensorflow version. thanks

craffel commented 3 years ago

It's comparing the Mesh Tensorflow version to the Hugging Face PyTorch version.

rabeehkarimimahabadi commented 3 years ago

Hi sorry I think there is misunderstanding, I evaluated the HF pytorch version so the model which wraps the huggingface model. thanks.

craffel commented 3 years ago

Yes, we have verified that that model gives the same outputs as the mesh tensorflow version.

rabeehkarimimahabadi commented 3 years ago

Hi, thanks for the response, I still think the discussion in huggingface/transformers#5543 is comparing mesh transformer version with huggingface model, but I evaluated this model: https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/models/hf_model.py thanks

craffel commented 3 years ago

The model in https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/models/hf_model.py is the hugging face pytorch model. The code in that file just calls out to the hugging face library.

rabeehkarimimahabadi commented 3 years ago

Hi The encoding/decoding part of this model in data pipeline processing does not match the huggingface one and I am thinking this might be the cause of the difference.

On Fri, Oct 30, 2020 at 8:46 PM Colin Raffel notifications@github.com wrote:

The model in https://github.com/google-research/text-to-text-transfer-transformer/blob/master/t5/models/hf_model.py is the hugging face pytorch model. The code in that file just calls out to the hugging face library.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/google-research/text-to-text-transfer-transformer/issues/463#issuecomment-719760722, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARPXHHZBFSQSOEU2T5UNQELSNMJYLANCNFSM4S2KOIIQ .

craffel commented 3 years ago

Can you be specific about any differences you have found? The encoding and decoding both use the same sentencepiece model.

rabeehkarimimahabadi commented 3 years ago

Hi I am not sure which part is exactly different causing it, this require a deeper look into codes for debugging. I just tried to run your HF model on WMT dataset from scratch.

On Fri, Oct 30, 2020 at 9:46 PM Colin Raffel notifications@github.com wrote:

Can you be specific about any differences you have found? The encoding and decoding both use the same sentencepiece model.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/google-research/text-to-text-transfer-transformer/issues/463#issuecomment-719789371, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARPXHH442M3POJVGK6YXUG3SNMQZPANCNFSM4S2KOIIQ .

rabeehkarimimahabadi commented 3 years ago

Hi I investigated the codes more, as I guessed, the way you encode inputs /decode the final outputs does not match the huggingface model, resulting in poor performance, below, please find how I corrected the predict function in your HF model:

dataset_len = len(inputs) dataset = tf.data.Dataset.from_tensor_slices(inputs) import numpy as np from transformers import T5Tokenizer path="/home/rabeeh/pl/data/t5-small" max_length=sequence_length["inputs"] tokenizer = T5Tokenizer.from_pretrained(path) dataset = tfds.as_numpy(dataset) def data_collator(batch): batch = np.stack([x.decode("utf-8") for x in batch]) input_encodings = tokenizer.batch_encode_plus(batch, pad_to_max_length=True,

max_length=max_length, return_tensors="pt") return input_encodings class IterableDataset(torch.utils.data.IterableDataset): def init(self, iterable): super(IterableDataset).init() self.iterable = iterable def iter(self): return self.iterable dataset = IterableDataset(iter(dataset)) num_batches = int(dataset_len / batch_size) loader = torch.utils.data.DataLoader(dataset, batch_size=batch_size, collate_fn=datacollator) for , batch in enumerate(itertools.islice(loader, num_batches)): predicted_tokens = self._model.generate(batch['input_ids'].cuda(), **generate_kwargs) predictions = [tokenizer.decode(ids) for ids in predicted_tokens] print(predictions)

On Fri, Oct 30, 2020 at 10:29 PM Rabeeh Karimi Mahabadi rabeeh@google.com wrote:

Hi I am not sure which part is exactly different causing it, this require a deeper look into codes for debugging. I just tried to run your HF model on WMT dataset from scratch.

On Fri, Oct 30, 2020 at 9:46 PM Colin Raffel notifications@github.com wrote:

Can you be specific about any differences you have found? The encoding and decoding both use the same sentencepiece model.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/google-research/text-to-text-transfer-transformer/issues/463#issuecomment-719789371, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARPXHH442M3POJVGK6YXUG3SNMQZPANCNFSM4S2KOIIQ .