huggingface / trl

Train transformer language models with reinforcement learning.
http://hf.co/docs/trl
Apache License 2.0
9.34k stars 1.17k forks source link

Error while loading a pretrained EncoderDecoder model #670

Closed ribesstefano closed 10 months ago

ribesstefano commented 1 year ago

Hi all, I've trained an EncoderDecoder model but I cannot load it via AutoModelForSeq2SeqLMWithValueHead.from_pretrained.

My EncoderDecoder model has been defined by tying together two RoBERTa model, as follows:

from transformers import EncoderDecoderModel

pretrained_encoder = "seyonec/ChemBERTa-zinc-base-v1"
bert2bert = EncoderDecoderModel.from_encoder_decoder_pretrained(
    pretrained_encoder,
    pretrained_encoder,
    tie_encoder_decoder=True,
)

Transformers and TRL versions:

trl.__version__: 0.5.0
transformers.__version__: 4.31.0

Code to reproduce the issue:

import torch
from transformers import (
    EncoderDecoderConfig,
    AutoTokenizer,
)
from trl import (
    AutoModelForSeq2SeqLMWithValueHead,
    PPOConfig,
    PPOTrainer,
    create_reference_model,
)

pretrained_model = "ribesstefano/ChemBERTa2ChemBERTa-58M"
model = AutoModelForSeq2SeqLMWithValueHead.from_pretrained(pretrained_model)
model_ref = create_reference_model(model, num_shared_layers=6)

tokenizer = AutoTokenizer.from_pretrained(pretrained_model)
tokenizer.pad_token = tokenizer.eos_token

Error message:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
[<ipython-input-45-7d8719dacb7f>](https://localhost:8080/#) in <cell line: 22>()
     20 # print(bert2bert.config.hidden_size)
     21 
---> 22 model = AutoModelForSeq2SeqLMWithValueHead.from_pretrained(pretrained_model)
     23 # # model_ref = AutoModelForSeq2SeqLMWithValueHead.from_pretrained(pretrained_model, device_map="auto", load_in_8bit=True)
     24 # model_ref = create_reference_model(model, num_shared_layers=6)

[/usr/local/lib/python3.10/dist-packages/trl/models/modeling_base.py](https://localhost:8080/#) in from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
    230                     raise ValueError("PromptLearningConfig is not supported for PPO training.")
    231         # Then, create the full model by instantiating the wrapper class
--> 232         model = cls(pretrained_model, **trl_model_args)
    233 
    234         # if resume_training, load the state_dict again - this is ok since the

[/usr/local/lib/python3.10/dist-packages/trl/models/modeling_value_head.py](https://localhost:8080/#) in __init__(self, pretrained_model, **kwargs)
    289             raise ValueError("The model does not have a language model head, please use a model that has one.")
    290 
--> 291         self.v_head = ValueHead(self.pretrained_model.config, **v_head_kwargs)
    292 
    293         self._init_weights(**v_head_kwargs)

[/usr/local/lib/python3.10/dist-packages/trl/models/modeling_value_head.py](https://localhost:8080/#) in __init__(self, config, **kwargs)
     37             hidden_size = config.word_embed_proj_dim
     38         else:
---> 39             hidden_size = config.hidden_size
     40 
     41         self.summary = nn.Linear(hidden_size, 1)

[/usr/local/lib/python3.10/dist-packages/transformers/configuration_utils.py](https://localhost:8080/#) in __getattribute__(self, key)
    259         if key != "attribute_map" and key in super().__getattribute__("attribute_map"):
    260             key = super().__getattribute__("attribute_map")[key]
--> 261         return super().__getattribute__(key)
    262 
    263     def __init__(self, **kwargs):

AttributeError: 'EncoderDecoderConfig' object has no attribute 'hidden_size'

From what I can guess from the source code, ValueHead only inspect the model's top config, expecting an hidden_size. However, that information is in the model.decoder.config.hidden_size for an EncoderDecoder class. Does it sound like a bug or something to update in the code? Or is there any workaround I can try?

Thanks in advance

younesbelkada commented 1 year ago

Hmm I think the AutoModelForSeq2SeqLMWithValueHead class does not support EncoderDecoder class as it requires some patches. Indeed, the solution would be to look for the child configs, we can do that by passing a flag is_composition_model=True in from_pretrained. Would you be happy to propose a patch for this? Happy to guide you in more details!

ribesstefano commented 1 year ago

Thanks for the support!

I wrote a quick fix. The following code seems to work now:

pretrained_model = "ribesstefano/ChemBERTa2ChemBERTa-58M"
model = AutoModelForSeq2SeqLMWithValueHead.from_pretrained(
    pretrained_model,
    is_composition_model=True)
model_ref = create_reference_model(model)
# model_ref = create_reference_model(model, num_shared_layers=6)

but create_reference_model(model, num_shared_layers=6) still breaks when specifying the num_shared_layers argument.

This is what I did: in the from_pretrained method of the PreTrainedModelWrapper I added the is_composition_model to the trl_model_args:

        if kwargs is not None:
            peft_config = kwargs.pop("peft_config", None)
            reward_adapter = kwargs.pop("reward_adapter", None)
            is_trainable = kwargs.pop("is_trainable", False)
            is_composition_model = kwargs.pop("is_composition_model", False)
            trl_model_args, pretrained_kwargs, peft_quantization_kwargs = cls._split_kwargs(kwargs)
            trl_model_args["is_composition_model"] = is_composition_model
            token = pretrained_kwargs.get("token", None)
        else:
            peft_config = None
            is_trainable = False
            trl_model_args = {}
            pretrained_kwargs = {}
            peft_quantization_kwargs = {}
            token = None
            is_composition_model = False

Then, in the AutoModelForSeq2SeqLMWithValueHead init method I added is_composition_model to the v_head_kwargs:

        self.is_composition_model = kwargs.pop("is_composition_model", False)
        v_head_kwargs["is_composition_model"] = self.is_composition_model

Finally, in the ValueHead init method I do:

is_composition_model = kwargs.pop("is_composition_model", False)
if is_composition_model:
    hidden_size = config.decoder.hidden_size
elif hasattr(config, "word_embed_proj_dim"):
    hidden_size = config.word_embed_proj_dim
else:
    hidden_size = config.hidden_size

Does this make sense? According to the pull request guidelines, I shall write a test beforehand. Could give me some suggestions on that too please?

younesbelkada commented 1 year ago

Thanks very much @ribesstefano the solution sounds great ! I have a better workaround, instead of passing is_composition_model=True we can maybe do is_composition_model = isinstance(model, transformers.EncoderDecoderModel) - in any case let's see in the PR ! For the tests I have just pushed a tiny model on the Hub : https://huggingface.co/trl-internal-testing/tiny-Roberta-EncoderDecoderModel

I would say you can just add that model id here: https://github.com/huggingface/trl/blob/main/tests/test_modeling_value_head.py#L54 and see if the tests pass

Let me know how it goes!

dshvimer commented 1 year ago

@younesbelkada I have a similar error when trying to load Salesforce/codet5p-16b. I was wondering if the hidden_size I need is referenced here? If so, I'm happy to make a PR for the fix if you would offer some guidance. In the meantime I'm looking at the solution posted above.

younesbelkada commented 1 year ago

hi @dshvimer Thanks! hidden_size should be properly defined in that model, e.g.: https://huggingface.co/Salesforce/codet5p-16b/blob/d19a75e34b9fd56b260557468de793b112855f1a/modeling_codet5p.py#L92 So I would expect your script to work. Can you confirm just running:

model = AutoModelForSeq2SeqLMWithValueHead.from_pretrained("Salesforce/codet5p-16b")

Leads to an error for you?

dshvimer commented 1 year ago

@younesbelkada Here is an example notebook with the larger model failing. From my testing <= 2b works with trl but >2b does not

younesbelkada commented 1 year ago

Thanks for sharing the notebook! From what I can see it seems that this related to the fact the second model uses classic T5: https://huggingface.co/Salesforce/codet5p-220m/blob/main/config.json that has does not use composition in the configuration object as you can see here: https://huggingface.co/Salesforce/codet5p-6b/blob/main/configuration_codet5p.py#L71 As this is a model that uses a model on the Hub feature, I am not sure if we should support all scenarios as this will be challenging. One thing you could do is to push a copy of Salesforce/codet5p-6b under your namespace and manually add self.hidden_size = self.encoder.hidden_size here: https://huggingface.co/Salesforce/codet5p-6b/blob/main/configuration_codet5p.py#L90

dshvimersalesloft commented 1 year ago

@younesbelkada Thank you for the direction. I am thinking this will unblock me for my own use case. To make sure I understand things, could I also set the hidden_size from the trl side, similar to what is posted above? (If I create a subclass or similar)

is_composition_model = kwargs.pop("is_composition_model", False)
if is_composition_model:
    hidden_size = config.decoder.hidden_size
elif hasattr(config, "word_embed_proj_dim"):
    hidden_size = config.word_embed_proj_dim
else:
    hidden_size = config.hidden_size
younesbelkada commented 1 year ago

Yes the changes you shared make sense! @ribesstefano let us know if you have managed to start the PR so that we could jump over potential improvements, let us know also if you need help for it! :D

ribesstefano commented 11 months ago

@younesbelkada I've made a way simpler modification, as you suggested, and just changed the ValueHead init in file trl/models/modeling_value_head.py:

class ValueHead(nn.Module):
    r"""
    The ValueHead class implements a head for GPT2 that returns a scalar for each output token.
    """

    def __init__(self, config, **kwargs):
        super().__init__()
        if not hasattr(config, "summary_dropout_prob"):
            summary_dropout_prob = kwargs.pop("summary_dropout_prob", 0.1)
        else:
            summary_dropout_prob = config.summary_dropout_prob

        self.dropout = nn.Dropout(summary_dropout_prob) if summary_dropout_prob else nn.Identity()

        # some models such as OPT have a projection layer before the word embeddings - e.g. OPT-350m
        if hasattr(config, "hidden_size"):
            hidden_size = config.hidden_size
        if hasattr(config, "word_embed_proj_dim"):
            hidden_size = config.word_embed_proj_dim
        elif hasattr(config, "is_encoder_decoder"):
            if config.is_encoder_decoder and hasattr(config, "decoder"):
                if hasattr(config.decoder, "hidden_size"):
                    hidden_size = config.decoder.hidden_size

        self.summary = nn.Linear(hidden_size, 1)

        self.flatten = nn.Flatten()

After all, the configs that the ValueHead is expecting are from the pretrained model, so it is sufficient to check those only.

However, when running tests, to which I've added the https://huggingface.co/trl-internal-testing/tiny-Roberta-EncoderDecoderModel model, it breaks for exactly that added tiny-Roberta-EncoderDecoderModel model (it passes for all the other models):

===================================================================================================================== FAILURES =====================================================================================================================
____________________________________________________________________________________________ Seq2SeqValueHeadModelTester.test_transformers_bf16_kwargs _____________________________________________________________________________________________
[gw0] linux -- Python 3.10.8 /opt/conda/bin/python

self = <tests.test_modeling_value_head.Seq2SeqValueHeadModelTester testMethod=test_transformers_bf16_kwargs>

    def test_transformers_bf16_kwargs(self):
        r"""
        Test if the transformers kwargs are correctly passed
        Here we check that loading a model in half precision works as expected, i.e. the weights of
        the `pretrained_model` attribute is loaded in half precision and you can run a dummy
        forward pass without any issue.
        """
        for model_name in self.all_model_names:
            trl_model = self.trl_model_class.from_pretrained(model_name, torch_dtype=torch.bfloat16)

            lm_head_namings = self.trl_model_class.lm_head_namings

            if model_name == "trl-internal-testing/tiny-random-FSMTForConditionalGeneration":
                # skip the test for FSMT as it does not support mixed-prec
                continue

>           self.assertTrue(
                any(hasattr(trl_model.pretrained_model, lm_head_naming) for lm_head_naming in lm_head_namings)
            )
E           AssertionError: False is not true

tests/test_modeling_value_head.py:430: AssertionError

Do you think the error is related to the modifications I've done in ValueHead?

younesbelkada commented 11 months ago

Thanks for reporting and for your hardwork ! Regarding the failing test since this is an edge case you can skip it and add a condition similarly as FSMT together with a comment. Let me know once you open the PR!

ribesstefano commented 11 months ago

@younesbelkada Thanks for all the support, I just made a pull request!