Swapping `tqdm` to `rich`

huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

https://huggingface.co/transformers

Apache License 2.0

131.7k stars 26.22k forks source link

Swapping `tqdm` to `rich` #29064

Open alexge233 opened 6 months ago

alexge233 commented 6 months ago

Feature request

Hi, for AutoTokenizer.train_new_from_iterator there's a hardcoded tqdm progress bar I want to swap to rich and I'm happy to PR it back.

I can see on github it's at transformers/src/transformers/tokenization_utils_fast.py and I can see in lines #790 and #791 that there's a further method train_from_iterator but at this point I can't find where the actual code is? Can anyone point me to the right direction?

Also, is there any reason to go against adding rich as a dependency? Where are the tqdm specific bits of code, so I can go through them? Thanks!

Motivation

I'm not fond of tqdm it seems to create issues when used on AWS, SageMaker, etc. It's span is large and doesn't contain nearly enough information as rich can. I wanna start by going over AutoTokenizer because it's where I first spotted it.

Your contribution

Slowly work through bits of code which rely on tqdm and add the option to swap for rich instead.

geronimi73 commented 6 months ago

I can see on github it's at transformers/src/transformers/tokenization_utils_fast.py and I can see in lines #790 and #791 that there's a further method train_from_iterator but at this point I can't find where the actual code is? Can anyone point me to the right direction?

train_from_iterator is defined here: https://github.com/huggingface/transformers/blob/ee3af60be0d21044692211d97dfd858aa3e4b418/examples/flax/language-modeling/t5_tokenizer_model.py#L89

alexge233 commented 6 months ago

@geronimi73 This seems to imply it's for t5_tokenizer_model aka only for a T5? Isn't there a common base class?

geronimi73 commented 6 months ago

yes, you're totally right, i'm sorry! can you show me entire code you're using?

amyeroberts commented 6 months ago

Hi @alexge233, thanks for opening this feature request!

There's been previous discussions internally about adding rich, and the conclusion was to stick with tqdm for the moment as:

It has caused issues in the past with logs in accelerate. In particular with different running environments e.g. Jupyter notebooks c.f. these issues #26085 #1630
Issues with writing logs to file for large runs

Enabling would mean adding in conditional logic, which would still fallback to tqdm. We want to keep our (optional) dependencies as small as possible, and adding rich would be going against that at the moment

alexge233 commented 6 months ago

Hi,

Thanks for your reply and feedback. I’m a bit surprised because in my experience it’s fqdm that seems to have issues, especially in cloud environments where I see re-rendering of previous progress bars, incorrect width in the terminal, etc.

Which was my primary reason for wanting to implement this. I would have assumed rich was more mature than tqdm.

Feel free to close this ticket then, I’ll have to deal with it internally. Could you please point me to where the tqdm implementation is done in the codebase? I’d really appreciate it.

Best Regards

amyeroberts commented 6 months ago

Hi @alexge233,

OK, I'll close this issue.

Could you please point me to where the tqdm implementation is done in the codebase? I’d really appreciate it.

Throughout most of transformers we just use either from tqdm import tqdm or from tqdm.auto import tqdm. There's a custom tqdm class defined here in logging.