huggingface / transformers

πŸ€— Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.85k stars 26.97k forks source link

Switching from argparse to Typer #2959

Closed kabirkhan closed 4 years ago

kabirkhan commented 4 years ago

πŸš€ Feature request

I'm pretty new to transformers and I'm finding the examples a bit hard to read. Mainly I think this is due to argparse being so verbose. Is there any interest in migrating to something like Plac or even better Typer? I think it would make all of the examples a lot easier to grasp right away and friendlier to newer users.

Motivation

It will make the examples easier to read and a lot shorter. Currently, in an example like https://github.com/huggingface/transformers/blob/master/examples/run_language_modeling.py, just parsing CLI arguments takes 140 lines of the 799 lines of code (17.5 % of the total code).

Then the args object is passed around to all of the other functions and that becomes hard to deal with when first looking at the example and not having an understanding of exactly what values each function needs.

Your contribution

I'm happy to contribute the change across the transformers-cli and all the examples over time. Want to better understand the project requirements around adding a new dependency before submitting any PR.

kabirkhan commented 4 years ago

Looks like sacremoses already uses Click as a dependency so using Typer would be ideal as it's only dependency is Click and it provides a much nicer interface than Click using Python 3.6+ Type Hints

kabirkhan commented 4 years ago

Also, since Typer is using Python 3.6+ Type Hints this would require dropping Python 3.5. While that's not ideal it is the general trend for a lot projects these days as the benefits of using type hints and proper dict ordering are really high when it comes to testing and general usability.

More discussion here about Python 3.5 and whether it's worth supporting since only about a little over 1% of installs come from Python 3.5: https://github.com/huggingface/transformers/issues/2608

srush commented 4 years ago

So I don't think the issue is argument parsing, that code is pretty much declarative and will look the same in all implementations.

We have been experimenting with using lightning to simplify the interface for these models. Do you find the code here to be more readable?

https://github.com/huggingface/transformers/blob/master/examples/ner/run_pl_ner.py

kabirkhan commented 4 years ago

Code Readability

I actually think we're talking about 2 different problems:

  1. Readability of the examples overall is not excellent. (This is addressed nicely with Pytorch Lightning)
  2. The management of command line args with argparse creates confusion as you're passing around untyped arguments to arbitrary functions. And declaring arguments is very verbose. Typer basically makes typed python functions into CLI's mostly automatically and cleans up the code a lot.

I've implemented a couple draft PRs for using Typer.

  1. Example of migrating one example to Typer https://github.com/huggingface/transformers/pull/2962
  2. Full rewrite of the transformers-cli using Typer: https://github.com/huggingface/transformers/pull/2974

Typer reduces the amount of code significantly in both cases while also making the functions easier to read and understand.

CLI Usability

There's a larger discussion about using CLI Arguments vs Options that should be had as well. Most of the examples are overly verbose to run using the existing CLI options.

For instance, in the run_generation.py example I migrated (PR 1 above) there are only 2 required options (model_type and model_name_or_path). I made these Typer Options to not break convention for now but they should both be arguments.

That way, instead of writing:

python examples/run_generation.py --model_type gpt2 --model_name_or_path distilgpt2

the user can write something like:

python examples/run_generation.py gpt2 distilgpt2

And the docstring for the function can document that these arguments are required. Typer automatically uses the docstring in the help.

So here's the automatic help docs for run_generation

python examples/run_generation.py --help
Usage: run_generation.py [OPTIONS] MODEL_TYPE MODEL_NAME_OR_PATH

  Generate text based on a prompt using one of [gpt2, ctrl, openai-gpt,
  xlnet, transfo-xl, xlm] as the model_type and a a supported model name or
  path for that model_type

  e.g.

      $ python examples/run_generation.py gpt2 distilgpt2

Options:
  --prompt TEXT
  --length INTEGER
  --stop-token TEXT               Token at which text generation is stopped
  --temperature FLOAT             temperature of 1.0 has no effect, lower tend
                                  toward greedy sampling
  --repetition-penalty FLOAT      primarily useful for CTRL model; in that
                                  case, use 1.2
  --k INTEGER
  --p FLOAT
  --padding-text TEXT             Padding text for Transfo-XL and XLNet.
  --xlm-language TEXT             Optional language when used with the XLM
                                  model
  --seed INTEGER                  random seed for initialization
  --no-cuda                       Don't use CUDA and run on CPU.
  --num-return-sequences INTEGER  The number of samples to generate.
  --help                          Show this message and exit.
srush commented 4 years ago

The work on transformers-cli (2) seems interesting as there are complex types there. I am pretty unconvinced on (1). The code reduction is mostly aesthetic, I don't see any really complexity wins. Given that I'm apt to stick with argparse as it is standard. (The argument/options thing could also be done in argparse. )

kabirkhan commented 4 years ago

Thanks for the feedback

Actually I think it's more standard to use a CLI parsing dependency over argparse these days. Not a huge deal and it's not my library but I've just heard the same feedback about argparse in the examples from a few colleagues around Microsoft which is why I decided to propose the change.

If you do have some time to give a quick review on (2) that would be awesome. I think the changes there offer a lot of clarity particularly with using the Enum types.

srush commented 4 years ago

@julien-c any thoughts on this? I don't think we want another dependency, but @kabirkhan did put a lot of work into restructuring CLI https://github.com/huggingface/transformers/pull/2974

BramVanroy commented 4 years ago

My two cents, or maybe just one cent: I have always been torn with this, the same with plac. It feels more verbose than argparse but also, it doesn't.

Here, in this case, we currently already have the register_subcommands so I think Typer actually makes sense. Looking at the code, it does greatly reduce redundancy (except for the app.command() calls). However, it is a lot less known (I think) and I still feel it is less intuitive than good ol' argparse. So if this were a vote, I'd vote aye, keeping in mind that it might be a "learning curve". On top of that, all future CLI scripts should also use this library to streamline the experience and development. As a bonus, dropping 3.5 support is an excellent side-effect in my opinion (Typing, f-strings, ordered dicts), but one might want to check how many users are on 3.5.

kabirkhan commented 4 years ago

@BramVanroy thanks for the input. Yeah the app.command() calls are annoying, fixed that with a for loop just now. I hear your point on the "learning curve" but Typer is truly very easy to learn and really well documented. It's built by tiangolo (same guy who made FastAPI). Also the current CLI already requires python 3.6

julien-c commented 4 years ago

@BramVanroy less than 1% of pip installs are on Python 3.5 according to https://pypistats.org/packages/transformers – we will probably drop it in the next couple of weeks or months

@kabirkhan Thanks for the work you've put into those PRs! This is interesting, and analogous to ideas we've been discussing internally and here in other issues. More generally we are in the process of thinking about rebuilding our training example scripts in a more scalable/re-usable way. We'll link our first thoughts from here.

julien-c commented 4 years ago

I'll close this as we now have a very simple (100 lines of code) built-in argument parser named HfArgumentParser.

Example scripts are now way cleaner after #3800.

Will also close associated PR #2974.

Would love to get your feedback @kabirkhan, thanks for prompting our reflection.