huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.28k stars 26.85k forks source link

Generate: deprecate the use of model `config` as a source of defaults #18655

Closed gante closed 1 year ago

gante commented 2 years ago

EDIT: Updated with the discussion up to 2022/08/20

Why?

A confusing part of generate is how the defaults are set. When a certain argument is not specified, we attempt to fetch it from the model config file. This makes generate unpredictable and hard to fully document (the default values change for each model), as well as a major source of issues :hocho:

How?

We have the following requirements: 1️⃣ The existing behavior can’t be removed, i.e., we must be able to use the model config.json as a source of generation parameters by default; 2️⃣ We do need per-model defaults -- some models are designed to do a certain thing (e.g. summarization), which requires a specific generation configuration. 3️⃣ Users must have full control over generate, with minimal hidden behavior.

Ideally, we also want to: 4️⃣ Have separation of concerns and use a new generate_config.json to parameterize generation;

A TL;DR of the plan consists in changing the paradigm from “non-specified generate arguments are overridden by the [model] configuration file” to “generate arguments will override the [generate] configuration file, which is always used”. With proper documentation changes and logging/warnings, the user will be aware of what's being set for generate.

Step 1: Define a new generate config file and class

Similar to the model config, we want a .json file to store the generation defaults. The class itself can be a very simplified version of PretrainedConfig, also with functionality to load/store from the hub.

Step 2: Integrate loading generate config file in .from_pretrained()

The generation configuration file should be loaded when initializing the model with a from_pretrained() method. A couple of things to keep in mind:

  1. There will be a new kwarg in from_pretrained, generate_config (or generation_config? Leaning toward the former as it has the same name as the function);
  2. It will default to generate_config.json (contrarily to the model config, which defaults to None). This will allow users to set this argument to None, to load a model with an empty generate config. Some users have requested a feature like this;
  3. Because the argument can take a path, it means that users can store/load multiple generate configs if they wish to do so (e.g. to use the same model for summarization, creative generation, factual question-answering, etc) 🚀
  4. Only models that can run generate will attempt to load it;
  5. If there is no generate_config.json in the repo, it will attempt to initialize the generate configuration from the model config.json. This means that this solution will not change any generate behavior and will NOT need a major release 👼
  6. To keep the user in the loop, log ALL parameters set when loading the generation config file. Something like the snippet below.
  7. Because this happens at from_pretrained() time, logging will only happen at most once and will not be verbose.
`facebook/opt-1.3b` generate configuration loaded from `generate_config.json`. The following generation defaults were set:
- max_length: 20
- foo: bar
- baz: qux

Step 3: Generate uses the generate config class internally

Instead of using the configuration to override arguments when they are not set, overwrite a copy of the generation config at generate time. I.e. instead of:

arg = arg if arg is not None else self.config.arg
...

do

generate_config = self.generate_config.copy()
generate_config.arg = arg if arg is not None
...

This change has three main benefits:

  1. We can improve the readability of the code, as we gain the ability to pass configs around. E.g. this function won't need to take a large list of arguments nor to bother with their initialization.
  2. Building generate argument validation for each type of generation can be built in simple functions that don't need ~30 arguments as input 🙃
  3. The three frameworks (PT/TF/FLAX) can share functionality like argument validation, decreasing maintenance burden.

Step 4: Document and open PRs with the generation config file

Rewrite part of the documentation to explain that a generation config is ALWAYS used (regardless of having defaults loaded from the hub or not). Open Hub PRs to pull generate-specific parameters from config.json to generate_config.json

Pros/Cons

Pros:

Cons:

gante commented 2 years ago

cc @patrickvonplaten

patrickvonplaten commented 2 years ago

I like the idea of using a use_config_defaults a lot - think that's a great additional safety mechanism to ensure it's possible to keep backward compatibility.

Also we were thinking about the idea of having a generation_config.json file that can optionally be passed to generate by the user and that includes all the default values that are set in the config at the moment. This would also make it easier to possible have multiple different generation configs. Some models like bart-large: https://huggingface.co/facebook/bart-large/blob/main/config.json#L45 always have certain generation parameters enabled by default and IMO it would be a smoother transition to help the user extract a generation_config.json from config.json and then always pass this config if present in the repo to generate(...) instead of forcing the user to always pass all those arguments to generate.

With the config, we could do something like the following automatically:

Also happy to jump on a call to brainstorm about this a bit!

gante commented 2 years ago

Fair point! 👍

From the comment above, let's consider the updated requirements:

  1. Until v5, the default behavior can’t change, i.e., we will use the model config.json as a source of defaults;
  2. From v5 onwards, the default behavior is to use generate_config.json as a source of defaults;
  3. The transition should be as smooth as possible — the users should be able to anticipate this transition, so nothing changes when we release the new major version;
  4. We want to use defaults (many models are designed to do a certain thing) while also enabling power users to have full control over generate.

A solution that fits all requirements is the ability to specify where the defaults should be loaded from, with default paths controlled by us. With the aid of a script to create the new generation config file from the existing model config file, the transition should be smooth and users can anticipate any change.

E.g. if we have a generation_config_file flag, defaulting to None and where a path in the model repo can be specified, then we could:

We seem to need two warnings ⚠️ :

  1. [Needed because in v5 we will be defaulting to a new config file, which may not exist in a user's model repo, and the model may have generation parameters in its config] If the configuration file does not exist, fall back to config.json and warn about it. We can quickly scan config.json to avoid raising a warning if it doesn't contain any generation argument;
  2. [Needed because the default behavior will still be to use values from a config, and many users are not aware of it] If generation_config_file is not specifically set by the user, a warning should be raised if the config replaces any argument. Many configs don't replace any value.

Both warnings can be avoided by specifying the generation_config_file argument. They may be a bit verbose, but I think verbosity (which can be shut down easily) is preferable to magic confusing behavior.

The max_length=20 default (and other similar defaults) can be easily added -- max_length = max_length if max_length is not None else 20 after attempting to load the configs. We can add them to the argument's documentation (see below).


🤔 The only issue I have with this approach is that it is hell to document (similar to the current approach). Having "this argument defaults to X or to config.argument" for all arguments' documentation line is verbose and confusing, and users need to be aware that the configuration files play an important role.

My suggestion here would be to make generation_config_file the second argument of generate (after input_ids), so that it becomes immediately clear that generate argument defaults can be set through a file. Then, I would remove further references to the config in the docs, relying on the warnings to remind the user of what's going on. I think it is clear by now that long docs don't avoid simple issues :(

WDYT?

(P.S.: will edit the issue after we settle on an approach :) )

patrickvonplaten commented 2 years ago

Cool, I think this is going into a very nice direction! A couple more questions to think about:

generate_config = # load generation config from path
model.generate(input_ids, config=generate_config)

-> What do you think?

gante commented 2 years ago

@patrickvonplaten Agreed, the argument name is a bit too long 😅 However, if we decide to go the GenerationMixin.__init__ route, we can't pick config -- PreTrainedModel, which inherits from GenerationMixin, uses a config argument for the model config. Perhaps generation_config? We could then do .from_pretrained(foo, generation_config=bar).

I love the ideas you gave around the config:

  1. if it is part of the __init__ and if we always attempt to load the new file format before falling back to the original config, it actually means we don't need to do a major release to build the final version of this updated configuration handling! No need to change defaults with a new release at all ❤️ ;
  2. The idea of "arguments write into a config that is always used" as opposed to "config is used when no arguments are passed" is much clearer to explain. We gain the ability to pass config files around (as opposed to tens of arguments), and it also opens the door to exporting generation configurations;
  3. Despite the above, we need to be careful with the overwrites: if a user calls model.generate(top_k=top_k) and then model.generate(temperature=temperature), top_k should be the original config's top_k. Copies of objects are needed;
  4. Agreed, having all downloads/file paths in the same place is helpful.

Regarding dict vs class -- I'd go with class (or perhaps a simpler dataclass). Much easier to document and enforce correctness, e.g. check if the right arguments are being used with a certain generation type.


It seems like we are in agreement. Are there more issues we can anticipate?

patrickvonplaten commented 2 years ago

Very nice summary @gante thanks for writing this all down - I agree with all the above point!

@LysandreJik @sgugger and maybe @thomwolf could you take a quick look here? I think @gante and I have now an actionable plan for generate() and would be ready to open a PR.

Before starting the PR, it would be nice if you could check if you generally agree with our comments here so that we're not totally on a different page before opening such a big PR. The PR will then take some time and require discussion, but I think we have a clear vision of what we want now

gante commented 2 years ago

@patrickvonplaten @LysandreJik @sgugger @thomwolf -- I took the liberty of updating the issue at the top with the plan that originated from the discussion here (and also to structure the whole thing in my head better) :)

LysandreJik commented 2 years ago

Thanks for the write-up! I think this is a much welcome change that will tremendously improve the way we use generate.

Writing down some thoughts below.

The biggest work here will likely be education & documentation. I think this will already make things much clearer, but I suppose the much awaited generate method doc rework will be an absolute requirement after this refactor!

gante commented 2 years ago

Agreed, the biggest issue is and will be education and documentation. Hopefully, this will make the process easier 🙏

# with the `task` parameter, it is trivial to share the parameters for some desired behavior

# When loading the model, the existence of task-specific options would be logged to the user.
model = AutoModelForSeq2SeqLM.from_pretrained("...")
input_prompt = ...
task_tokens = model.generate(**input_prompt, task="my_task")
# There would be an exception if `my_task` is not specified in the generation config file. 
sgugger commented 2 years ago

The plan looks good to me, but the devil will be in the details ;-) Looking forward to the PRs actioning this!

gante commented 1 year ago

Closing -- generation_config is now the source of defaults for .generate()