New `--log_level` feature introduces failures using 'passive' mode

allenwang28 commented 3 years ago

Environment info

transformers version: nightly
Platform: PyTorch
Python version: 3.6
PyTorch version (GPU?): TPU
Tensorflow version (GPU?): n/a
Using GPU in script?: no
Using distributed or parallel set-up in script?: yes

Who can help

@stas00 @sgugger

Information

Model I am using (Bert, XLNet ...): XLNet

The problem arises when using:

[x] the official example scripts: (give details below)
[ ] my own modified scripts: (give details below)

The tasks I am working on is:

[x] an official GLUE/SQUaD task: (give the name)
[ ] my own task or dataset: (give details below)

This was captured by Cloud TPU tests (XLNet/MNLI/GLUE), but I think this behavior is model/dataset agnostic. Essentially, it seems that:

The training_args's __post_init__ method should convert the log_level to -1 if it's set to 'passive' (which it is by default).

However in the end-to-end run_glue.py example, using parse_args_into_dataclasses() seems to not call __post_init__, as our tests are failing with:

Traceback (most recent call last):
File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
_start_fn(index, pf_cfg, fn, args)
File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
fn(gindex, *args)
File "/transformers/examples/pytorch/text-classification/run_glue.py", line 554, in _mp_fn
main()
File "/transformers/examples/pytorch/text-classification/run_glue.py", line 468, in main
data_collator=data_collator,
File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/transformers/trainer.py", line 295, in __init__
logging.set_verbosity(log_level)
File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/transformers/utils/logging.py", line 161, in set_verbosity
_get_library_root_logger().setLevel(verbosity)
File "/root/anaconda3/envs/pytorch/lib/python3.6/logging/__init__.py", line 1284, in setLevel
self.level = _checkLevel(level)
File "/root/anaconda3/envs/pytorch/lib/python3.6/logging/__init__.py", line 195, in _checkLevel
raise ValueError("Unknown level: %r" % level)
ValueError: Unknown level: 'passive'

To reproduce

Steps to reproduce the behavior:

The command we're using is:

git clone https://github.com/huggingface/transformers.git
cd transformers && pip install .
git log -1
pip install datasets
python examples/pytorch/xla_spawn.py \
  --num_cores 8 \
  examples/pytorch/text-classification/run_glue.py \
  --logging_dir=./tensorboard-metrics \
  --task_name MNLI \
  --cache_dir ./cache_dir \
  --do_train \
  --do_eval \
  --num_train_epochs 3 \
  --max_seq_length 128 \
  --learning_rate 3e-5 \
  --output_dir MNLI \
  --overwrite_output_dir \
  --logging_steps 30 \
  --save_steps 3000 \
  --overwrite_cache \
  --tpu_metrics_debug \
  --model_name_or_path xlnet-large-cased \
  --per_device_train_batch_size 32 \
  --per_device_eval_batch_size 16

Expected behavior

stas00 commented 3 years ago

Thank you for the report, it will be fixed shortly via https://github.com/huggingface/transformers/pull/12309

I'm just working on a test - need another 10min or so

allenwang28 commented 3 years ago

Thank you for fixing this so quickly!

huggingface / transformers