google-research / t5x

Apache License 2.0
2.67k stars 306 forks source link

Exporting models #198

Open peregilk opened 2 years ago

peregilk commented 2 years ago

Are there any instructions available on how to export the T5x models to other formats, for instance to PyTorch (or to another format that then can be exported to PyTorch)? I am trying to export a finetuned byT5-model.

peregilk commented 2 years ago

@adarob Just following up this. I found it relatively easy and convenient to train various versions of T5/mT5/byT5-models using this framework. The models are training extremely fast and stable on v3-8.

I have however investigated multiple ways of exporting the models. With absolutely no luck.

Ultimately I am trying to combine these models with other pyTorch models. Could you suggest what is the recommended way of saving/exporting T5x checkpoints/models? Would it for instance be to first save/export to Mesh Tensorflow?

adarob commented 2 years ago

See https://github.com/google-research/t5x/blob/main/t5x/checkpoint_importer.py for how we go from MTF to T5X checkpoints. It should be fairly easy to reverse this!

peregilk commented 2 years ago

@adarob Thanks for the answer. If I understand you correctly, there is no such converter today?

And a follow up question: If I were able to rewrite this script and convert them to MTF, what is the process from here? I understand that you would then have to try using t5_mesh_transformerto save the checkpoint. However, this script currently seem to have issues, and even the Deploy Notebook does not run. Several unanswered posts about this in the T5 repo.

If I understand correctly, the output from the t5_mesh_transformerwould here be what is referred to as the old tensorflow format? Or do I missunderstand something? I know there are some HuggingFace scripts for converting from the old Tensorflow format, like this one for mT5: "https://github.com/huggingface/transformers/blob/master/src/transformers/models/byt5/convert_byt5_original_tf_checkpoint_to_pytorch.py".

This seem like a very complicated way. Has no one ever exported a T5x-model to PyTorch or HuggingFace/Transformers?

adarob commented 2 years ago

Correct, there is no t5x->mtf converter. I just put in a fix for the deploy Colab (https://github.com/google-research/text-to-text-transfer-transformer/pull/974).

We have a more direct way to convert from T5X to SavedModel if that's what you're interested in, but we haven't prioritized open sourcing it thus far.

peregilk commented 2 years ago

Still some issues with the Deploy Colab. Added a comment to that issue.

My main issue here is finding a way to import the T5x models into HuggingFace Transformers. It seems like going through SavedModel is an option but I have not tested the last conversion here yet. Ill do that as soon I get the Deploy Notebook to run.

A tool for converting T5X to SavedModel would be awesome.

peregilk commented 2 years ago

@adarob Following up my own post here. First let me say that I am extremely impressed with the T5x for training T5 on the v3-8. It seems very stable and incredibly fast. IMO it is the best platform out there for running experiments on the very flexible T5 architecture.

I am however dependent on merging these models with other models in PyTorch/Transformers, and the conversion process seem very awkward.

If you were able to release the tool for converting T5X to the SavedModel format, I see two possible paths: A) Converting the SavedModels to tf-ckpt, so that they can be imported using transformers-cli. I have not yet found a good tool for doing SavedModels->tf.ckpts yet. I assume loading it and TensorFlow and saving it would work?

b) Convert SavedModels->ONXX and then convert ONXX-PyTorch.

Am I missing something very obvious here? Are there simpler paths?

stefan-it commented 2 years ago

Hi @peregilk ,

I have the same problem now, as I've pre-trained a T5 model from scratch, and I will have a look at the Notebook now.

Would be a cool contribution to Transformers as well to have a conversion script from T5X (/cc @patrickvonplaten and @patil-suraj) :)

versae commented 2 years ago

Related: https://github.com/huggingface/transformers/issues/15467 And an unsorted repo @patrickvonplaten put together which I still have to test: https://github.com/patrickvonplaten/t5-mtf-to-hf-converter

patrickvonplaten commented 2 years ago

Agree very much! It should be pretty easy to adapt the following function: https://github.com/huggingface/transformers/blob/b87c044c79a408c0a1e7f7b046b5b4ce999c2d0e/src/transformers/models/t5/modeling_t5.py#L73 to make it work from t5x -> HF PyTorch. @versae , @stefan-it - would you be interested in opening a PR for this? More than happy to help you guys on it

stefan-it commented 2 years ago

I'm currently working on it, will give some updates here :)

peregilk commented 2 years ago

@stefan-it Any luck with the T5X export?

stefan-it commented 2 years ago

Hi @peregilk , conversion is working, but the model performance is not good (I've tried it with some of T5X checkpoints and with own models).

Here's a draft: https://gist.github.com/stefan-it/30e4998ef159f33696e377a46f699d9f

Maybe someone could also have a look at it and find the error!

peregilk commented 2 years ago

@stefan-it Great. I will try this on my own models, but I guess there are still an error here if the performance is bad. Any ideas on how to iron out the bugs here @adarob or @patrickvonplaten?

If there are ways of going from mtf to transformers, maybe a good test-case is going from MTF -> Transformers, and then compare this with the result we get by going from MTF->T5X (this script) and then T5X->Transformers using your script Stefan?

patrickvonplaten commented 2 years ago

First step would be to go through the modeling code line-by-line to see where the differences are

peregilk commented 2 years ago

@stefan-it I thought I should take a look at this, and start by looking at what is the differences between converting the though the paths described above. But first: Have you tested this on different model sizes Stefan, and know that they all have bad performance? Have you made any changes/fixes on this lately?

peregilk commented 2 years ago

@patrickvonplaten @stefan-it I have started debugging this. I have created a repo on pere/test-t5-small. It contains some of the converted models for convenience. Unfortunately I am not 100% sure that the MTF t5-small I am pulling from the official google t5-bucket is the same that is used for creating the Huggingface t5-small. However, it seems very clear that there is a problem with the layer decoder.block.0.layer.0.SelfAttention.relative_attention_bias.weight. This could explain the behaviour.

You can also see this by running this code directly

from transformers import T5ForConditionalGeneration
google_model = T5ForConditionalGeneration.from_pretrained(“t5-small”,  from_flax=True)
converted_model = T5ForConditionalGeneration.from_pretrained(“pere/test-t5-small”,  from_flax=True)

You will see that you are getting the following error for the converted model:

Some weights of the Flax model were not used when initializing the PyTorch model T5ForConditionalGeneration:
 ['decoder.block.0.layer.0.SelfAttention.relative_attention_bias']

This layer is then randomly initiated. I see there are custom code for this layer. Maybe you know what is happening here, @stefan-it?

There is also another trivial t5_1_0 vs t5_1_1 issue that is described on the model card. But we can fix that later.

patrickvonplaten commented 2 years ago

Running this command:

from transformers import T5ForConditionalGeneration
google_model = T5ForConditionalGeneration.from_pretrained(“t5-small”,  from_flax=True)

looks correct to me and indeed pere/test-t5-small seems to be missing a weight. Where does this model come from? Was it coverted from T5X?

stefan-it commented 2 years ago

Hi @peregilk , I haven't tried to convert different model sizes.

But I will look at the conversion script again this week!

peregilk commented 2 years ago

@patrickvonplaten I did convert the model from gs://t5-data/pretrained_models/small/model.ckpt-1000000. If I understand correctly, this is the “official” t5-small-model, but I am not really sure that it is exactly the same model that is used as a basis in HuggingFace under “t5-small”.

This is a tf checkpoint (MTF), so I first converted it to T5X using the original MTF->T5X-conversion-script. I would be very surprised if converting an original T5X would not give the same result.

The reason for starting with the tf-checkpoint is that there exists a script for converting tf-checkpoints to HuggingFace Flax. The plan was to also do this, and then compare it layer by layer. However I discovered that there were some more obvious errors. It is probably best fixing those first.

peregilk commented 2 years ago

@patrickvonplaten Following up my own post. I did try to directly convert some of the t5x-models that are in the same bucket, and I am getting the same error on the same layer.

patrickvonplaten commented 2 years ago

Looking into the TF -> PT conversion now :-)

patrickvonplaten commented 2 years ago

@peregilk, I cannot reproduce this error. When I do the conversion as explained here: https://github.com/huggingface/transformers/pull/16328 (see comment in the added script), there are no weights missing. Note that the config.json is just copied from https://huggingface.co/t5-small

Can you take a look?

peregilk commented 2 years ago

@patrickvonplaten My “missing weights” appears when converting from the T5X version (ie the Flax implementation of T5) to the HuggingFace Flax version. The data format used in the TensorFlow implementation of T5 is different from the format used by the Flax implementation (called T5X).

The reason that I started with an T5-checkpoint (aka MTF), converted to T5X and then to HuggingFace-flax, was mainly so that the models could be compared.

peregilk commented 2 years ago

I have now converted the same model in two ways. a) pere/test-t5-small is converted using the internal MTF->T5X-conversion-script and then the script from the Gist that @stefan-it published. b) pere/test-t5-small-direct is converted using the script that @patrickvonplaten just checked in for converting MTF checkpoints to HuggingFace

It is now possible to compare the models layer by layer, and I have made a small Colab for doing that. Running it will show that only two of the layers are different. All the rest of the layers are identical! We should therefore be pretty close to getting a working T5X conversion script:

https://colab.research.google.com/gist/peregilk/0d044d652e58352694f16b7107870b3c/compare-models.ipynb

stefan-it commented 2 years ago

Ah, I see, thanks for that comparison script!

Problem is here:

# Only for layer 0:
t5x_encoder_rel_embedding = t5x_model["target"]["encoder"]["relpos_bias"]["rel_embedding"]
x, y = t5x_encoder_rel_embedding.shape

# Assigning
flax_model.params["encoder"]["block"]["0"]["layer"]["0"]["SelfAttention"]["relative_attention_bias"]["embedding"] = t5x_encoder_rel_embedding.reshape(y, x)

Initially I was very sceptical about this reshaping 😅 so I'm currently looking into it :)

stefan-it commented 2 years ago

@peregilk I forgot to initialize to use the embedding key in a previous version of the conversion script. I fixed it here:

https://gist.github.com/stefan-it/30e4998ef159f33696e377a46f699d9f#file-convert_t5x_checkpoint_to_flax-py-L115

Now the following warning is gone:

Some weights of the model checkpoint at ./exported were not used when initializing FlaxT5Model: {('decoder', 'block', '0', 'layer', '0', 'SelfAttention', 'relative_attention_bias')}
peregilk commented 2 years ago

@stefan-it Where you able to get a working model after that fix? This seem to just affect the decoder layer, there was an error on both the encoder and the decoder layer.

stefan-it commented 2 years ago

Hi @peregilk , this is just another fix, but not the root cause for our main problem.

I did some debugging. The original FLAX model weight for the "relative_attention_bias" embeddings would be:

original = np.asarray([[ 6.1875000e+00, -1.0000000e+01,  3.0625000e+00,
              -9.9375000e+00,  5.1250000e+00, -1.5750000e+01],
             [ 7.7187500e+00,  8.6875000e+00, -8.6328125e-01,
               2.9843750e+00,  8.1250000e+00,  6.4687500e+00],
             [ 5.7187500e+00,  6.7187500e+00, -1.0703125e+00,
               3.6718750e+00,  7.6875000e+00,  6.4062500e+00],
             [ 4.7500000e+00,  5.6562500e+00, -1.5703125e+00,
               3.9531250e+00,  7.3125000e+00,  6.3125000e+00],
             [ 4.1562500e+00,  4.9375000e+00, -2.6718750e+00,
               4.0625000e+00,  6.9375000e+00,  6.1250000e+00],
             [ 3.7343750e+00,  4.3750000e+00, -3.1406250e+00,
               4.1875000e+00,  6.6562500e+00,  6.0000000e+00],
             [ 3.4218750e+00,  3.9531250e+00, -3.3750000e+00,
               4.2187500e+00,  6.4062500e+00,  5.8750000e+00],
             [ 3.1718750e+00,  3.7187500e+00, -3.1718750e+00,
               4.2500000e+00,  6.1875000e+00,  5.7500000e+00],
             [ 2.6093750e+00,  2.9218750e+00, -2.7343750e+00,
               4.2812500e+00,  5.7187500e+00,  5.5312500e+00],
             [ 2.0156250e+00,  2.1718750e+00, -2.2343750e+00,
               4.2812500e+00,  5.0625000e+00,  5.2187500e+00],
             [ 1.5781250e+00,  1.4765625e+00, -1.4609375e+00,
               4.2812500e+00,  4.4062500e+00,  4.8750000e+00],
             [ 1.0703125e+00,  7.1875000e-01, -1.5625000e+00,
               4.2500000e+00,  3.6875000e+00,  4.4375000e+00],
             [ 6.6796875e-01,  3.4179688e-01, -1.0859375e+00,
               4.1250000e+00,  2.8750000e+00,  3.9531250e+00],
             [ 8.6914062e-02, -1.5075684e-02, -1.0781250e+00,
               4.0312500e+00,  2.0156250e+00,  3.4218750e+00],
             [-1.3574219e-01, -2.0605469e-01, -8.3984375e-01,
               3.9375000e+00,  1.3359375e+00,  2.9218750e+00],
             [-1.8437500e+00, -5.1171875e-01, -4.4335938e-01,
               3.7343750e+00, -3.4960938e-01,  1.7968750e+00],
             [ 8.0078125e-02, -8.0078125e-02,  3.3398438e-01,
              -3.5937500e-01,  3.1445312e-01, -2.7929688e-01],
             [ 3.4687500e+00, -1.2390137e-02,  8.6875000e+00,
               4.4375000e+00, -5.5468750e-01, -1.9000000e+01],
             [ 2.9062500e+00, -2.5585938e-01,  7.4062500e+00,
               5.0000000e+00, -1.0449219e-01,  8.0859375e-01],
             [ 2.7031250e+00, -2.3046875e-01,  6.7500000e+00,
               5.1562500e+00, -2.0996094e-01,  1.2812500e+00],
             [ 2.4375000e+00, -1.5234375e-01,  6.3125000e+00,
               5.1875000e+00, -1.5625000e-01,  1.4453125e+00],
             [ 2.2031250e+00, -2.8320312e-01,  5.9375000e+00,
               5.1875000e+00, -1.1328125e-01,  1.4375000e+00],
             [ 2.2343750e+00, -1.9140625e-01,  5.6562500e+00,
               5.1562500e+00, -1.2890625e-01,  1.5312500e+00],
             [ 1.9687500e+00, -2.2558594e-01,  5.4062500e+00,
               5.1875000e+00, -1.5332031e-01,  1.5781250e+00],
             [ 1.8203125e+00, -1.9140625e-01,  4.9062500e+00,
               5.1562500e+00, -1.8066406e-01,  1.6093750e+00],
             [ 1.6093750e+00, -1.9042969e-01,  4.2812500e+00,
               5.0000000e+00, -6.6406250e-02,  1.6328125e+00],
             [ 1.2734375e+00, -1.6601562e-01,  3.6718750e+00,
               4.9062500e+00, -1.6601562e-01,  1.6484375e+00],
             [ 1.0781250e+00, -2.4218750e-01,  2.9375000e+00,
               4.7187500e+00, -2.1582031e-01,  1.5937500e+00],
             [ 6.2500000e-01, -1.7871094e-01,  2.2187500e+00,
               4.5312500e+00, -1.4843750e-01,  1.5703125e+00],
             [ 2.4609375e-01, -1.9921875e-01,  1.3984375e+00,
               4.3437500e+00, -2.2656250e-01,  1.5625000e+00],
             [-1.3281250e-01, -1.7285156e-01,  7.3437500e-01,
               4.1562500e+00, -2.1875000e-01,  1.5703125e+00],
             [-1.7343750e+00, -1.8359375e-01, -4.4531250e-01,
               3.7812500e+00, -2.6171875e-01,  1.4921875e+00]])

this has a shape of (32, 6).

The loaded T5X weight has a shape of (6, 32) and looks like:

t5x = np.asarray([[ 6.1875e+00,  7.7188e+00,  5.7188e+00,  4.7500e+00,  4.1562e+00,
          3.7344e+00,  3.4219e+00,  3.1719e+00,  2.6094e+00,  2.0156e+00,
          1.5781e+00,  1.0703e+00,  6.6797e-01,  8.6914e-02, -1.3574e-01,
         -1.8438e+00,  8.0078e-02,  3.4688e+00,  2.9062e+00,  2.7031e+00,
          2.4375e+00,  2.2031e+00,  2.2344e+00,  1.9688e+00,  1.8203e+00,
          1.6094e+00,  1.2734e+00,  1.0781e+00,  6.2500e-01,  2.4609e-01,
         -1.3281e-01, -1.7344e+00],
        [-1.0000e+01,  8.6875e+00,  6.7188e+00,  5.6562e+00,  4.9375e+00,
          4.3750e+00,  3.9531e+00,  3.7188e+00,  2.9219e+00,  2.1719e+00,
          1.4766e+00,  7.1875e-01,  3.4180e-01, -1.5076e-02, -2.0605e-01,
         -5.1172e-01, -8.0078e-02, -1.2390e-02, -2.5586e-01, -2.3047e-01,
         -1.5234e-01, -2.8320e-01, -1.9141e-01, -2.2559e-01, -1.9141e-01,
         -1.9043e-01, -1.6602e-01, -2.4219e-01, -1.7871e-01, -1.9922e-01,
         -1.7285e-01, -1.8359e-01],
        [ 3.0625e+00, -8.6328e-01, -1.0703e+00, -1.5703e+00, -2.6719e+00,
         -3.1406e+00, -3.3750e+00, -3.1719e+00, -2.7344e+00, -2.2344e+00,
         -1.4609e+00, -1.5625e+00, -1.0859e+00, -1.0781e+00, -8.3984e-01,
         -4.4336e-01,  3.3398e-01,  8.6875e+00,  7.4062e+00,  6.7500e+00,
          6.3125e+00,  5.9375e+00,  5.6562e+00,  5.4062e+00,  4.9062e+00,
          4.2812e+00,  3.6719e+00,  2.9375e+00,  2.2188e+00,  1.3984e+00,
          7.3438e-01, -4.4531e-01],
        [-9.9375e+00,  2.9844e+00,  3.6719e+00,  3.9531e+00,  4.0625e+00,
          4.1875e+00,  4.2188e+00,  4.2500e+00,  4.2812e+00,  4.2812e+00,
          4.2812e+00,  4.2500e+00,  4.1250e+00,  4.0312e+00,  3.9375e+00,
          3.7344e+00, -3.5938e-01,  4.4375e+00,  5.0000e+00,  5.1562e+00,
          5.1875e+00,  5.1875e+00,  5.1562e+00,  5.1875e+00,  5.1562e+00,
          5.0000e+00,  4.9062e+00,  4.7188e+00,  4.5312e+00,  4.3438e+00,
          4.1562e+00,  3.7812e+00],
        [ 5.1250e+00,  8.1250e+00,  7.6875e+00,  7.3125e+00,  6.9375e+00,
          6.6562e+00,  6.4062e+00,  6.1875e+00,  5.7188e+00,  5.0625e+00,
          4.4062e+00,  3.6875e+00,  2.8750e+00,  2.0156e+00,  1.3359e+00,
         -3.4961e-01,  3.1445e-01, -5.5469e-01, -1.0449e-01, -2.0996e-01,
         -1.5625e-01, -1.1328e-01, -1.2891e-01, -1.5332e-01, -1.8066e-01,
         -6.6406e-02, -1.6602e-01, -2.1582e-01, -1.4844e-01, -2.2656e-01,
         -2.1875e-01, -2.6172e-01],
        [-1.5750e+01,  6.4688e+00,  6.4062e+00,  6.3125e+00,  6.1250e+00,
          6.0000e+00,  5.8750e+00,  5.7500e+00,  5.5312e+00,  5.2188e+00,
          4.8750e+00,  4.4375e+00,  3.9531e+00,  3.4219e+00,  2.9219e+00,
          1.7969e+00, -2.7930e-01, -1.9000e+01,  8.0859e-01,  1.2812e+00,
          1.4453e+00,  1.4375e+00,  1.5312e+00,  1.5781e+00,  1.6094e+00,
          1.6328e+00,  1.6484e+00,  1.5938e+00,  1.5703e+00,  1.5625e+00,
          1.5703e+00,  1.4922e+00]])

So t5x needs to re-shaped to (32, 6) but then the order is wrong!

Any help would highly be appreciated, I tried different order options, such as C, A and F (see here).

stefan-it commented 2 years ago

Ah.... re-shape is wrong here, I think it needs to be transposed?!

stefan-it commented 2 years ago

@peregilk Could you please test the new version of the script:

https://gist.github.com/stefan-it/30e4998ef159f33696e377a46f699d9f

Main differences are in line 52 and 113.

Btw. I've seen that you are using v1.0 and your changes w.r.t. wi_0 and wi_1 -> I will also add a kind of compatibility mode for v1.0 :)

peregilk commented 2 years ago

Ill test with 1.1 tomorrow. It would be great if we got it to work for both versions.

stefan-it commented 2 years ago

Cool, thanks! I have already update the GIST incl. checks for wi_0 and wi_1 split. It should then work with both 1.0 and 1.1.

patrickvonplaten commented 2 years ago

Cool, would also be very happy to help by adding a conversion script to HF's transformers somewhere!

peregilk commented 2 years ago

Great work, @stefan-it! I have converted a base t5.1.1-model, using the two conversion scripts. The models are identical!

For me, this is working now, and will allow me to make my T5X models available in HuggingFace. Thanks a lot!

What else is needed here @patrickvonplaten?

stefan-it commented 2 years ago

@peregilk Great to hear!

@patrickvonplaten I can create a PR for it that adds the conversion script (then located under /models/t5). I will also try to run some downstream evaluations on it.

peregilk commented 2 years ago

@stefan-it I am trying to convert some of my own models using the script. These are large mt5-models (bfloat16) that I know are giving good inference results within T5X. I am trying to upload them to HuggingFace and use the widget for downstream tasks. Unfortunately I am not getting very good results. I am currently debugging this. Did you have any luck running downstream tasks with the converted models?

stefan-it commented 2 years ago

@peregilk I think I've found a missing parameter: t5x_model["target"]["decoder"]["logits_dense"] is currently not used in the conversion script. Output is:

{'kernel': array([[-0.66015625, -0.24804688, -0.35546875, ..., -0.60546875,
         -0.70703125, -0.69140625],
        [ 0.15722656,  0.3671875 , -0.36132812, ...,  0.16308594,
          0.20410156,  0.14257812],
        [ 0.17773438,  0.8125    , -0.06884766, ...,  0.17285156,
          0.19140625,  0.17382812],
        ...,
        [-0.41992188, -0.03710938, -0.3671875 , ..., -0.3359375 ,
         -0.34179688, -0.40820312],
        [-0.59375   , -1.015625  ,  0.41796875, ..., -0.5703125 ,
         -0.5625    , -0.52734375],
        [-0.9609375 , -0.55078125, -0.14648438, ..., -0.953125  ,
         -0.921875  , -0.9296875 ]], dtype=float32)}

When I load the original T5 model checkpoint using:

from transformers import T5ForConditionalGeneration
model = T5ForConditionalGeneration.from_pretrained("google/t5-v1_1-small")

The lm_head.weight tensor heavily looks like a transposed version of it:

 ('lm_head.weight',
  Parameter containing:
  tensor([[-0.6602,  0.1572,  0.1777,  ..., -0.4199, -0.5938, -0.9609],
          [-0.2480,  0.3672,  0.8125,  ..., -0.0371, -1.0156, -0.5508],
          [-0.3555, -0.3613, -0.0688,  ..., -0.3672,  0.4180, -0.1465],
          ...,
          [-0.6055,  0.1631,  0.1729,  ..., -0.3359, -0.5703, -0.9531],
          [-0.7070,  0.2041,  0.1914,  ..., -0.3418, -0.5625, -0.9219],
          [-0.6914,  0.1426,  0.1738,  ..., -0.4082, -0.5273, -0.9297]],
         requires_grad=True))]

Additionally, the comparison script differs for lm_head. So now I'm going to implement it and let's see if it is working then!

stefan-it commented 2 years ago

On downstream task (I'm using the summarization example for T5), the original T5 small checkpoint yields:

{'loss': 4.823, 'learning_rate': 4.965170871703423e-05, 'epoch': 0.01}   
{'loss': 3.8096, 'learning_rate': 4.9303417434068464e-05, 'epoch': 0.01}              
{'loss': 3.5531, 'learning_rate': 4.8955126151102695e-05, 'epoch': 0.02}                                                                                                      
{'loss': 3.3806, 'learning_rate': 4.8606834868136925e-05, 'epoch': 0.03}                   
{'loss': 3.2755, 'learning_rate': 4.825854358517115e-05, 'epoch': 0.03}      
{'loss': 3.2111, 'learning_rate': 4.791025230220538e-05, 'epoch': 0.04}                                
{'loss': 3.1604, 'learning_rate': 4.756196101923962e-05, 'epoch': 0.05}                                                                 
{'loss': 3.1098, 'learning_rate': 4.721366973627384e-05, 'epoch': 0.06}                     
{'loss': 3.0518, 'learning_rate': 4.686537845330807e-05, 'epoch': 0.06}                                                                          
{'loss': 3.0341, 'learning_rate': 4.65170871703423e-05, 'epoch': 0.07}

whereas the T5x converted training shows:

'loss': 8.3275, 'learning_rate': 4.965170871703423e-05, 'epoch': 0.01}
{'loss': 6.7617, 'learning_rate': 4.9303417434068464e-05, 'epoch': 0.01}
{'loss': 6.4534, 'learning_rate': 4.8955126151102695e-05, 'epoch': 0.02}
{'loss': 6.2175, 'learning_rate': 4.8606834868136925e-05, 'epoch': 0.03}
{'loss': 6.0008, 'learning_rate': 4.825854358517115e-05, 'epoch': 0.03}
{'loss': 5.8203, 'learning_rate': 4.791025230220538e-05, 'epoch': 0.04}
{'loss': 5.6846, 'learning_rate': 4.756196101923962e-05, 'epoch': 0.05}
{'loss': 5.5439, 'learning_rate': 4.721366973627384e-05, 'epoch': 0.06}
{'loss': 5.4265, 'learning_rate': 4.686537845330807e-05, 'epoch': 0.06}
{'loss': 5.3312, 'learning_rate': 4.65170871703423e-05, 'epoch': 0.07}

(higher loss at the beginning). So I can also confirm that there's something wrong here.

I'll fix it now!

stefan-it commented 2 years ago

Hey @peregilk ,

after updating the conversion gist, the lm_head is now the same... and pre-training on downstream task shows the same loss as for the original T5 checkpoint:

{'loss': 4.823, 'learning_rate': 4.965170871703423e-05, 'epoch': 0.01}
{'loss': 3.8096, 'learning_rate': 4.9303417434068464e-05, 'epoch': 0.01}                                                                                                      
{'loss': 3.5531, 'learning_rate': 4.8955126151102695e-05, 'epoch': 0.02}

This should hopefully solve all problems now :hugs:

peregilk commented 2 years ago

@stefan-it Great work! I have now converted my first large mT5 model in T5X format to HuggingFace! The model seem to be working perfectly after running your script, and then using the standard method for converting it to pyTorch and adding a tokenizer.