Ray tune hyperparameters search error

howardlau1999 commented 3 years ago

Environment info

transformers version: 4.1.0.dev0
Platform: Linux-4.4.0-139-generic-x86_64-with-glibc2.10
Python version: 3.8.5
PyTorch version (GPU?): 1.7.1 (True)
Tensorflow version (GPU?): 2.3.1 (True)
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: Yes

Who can help

@sgugger

Information

Model I am using (Bert, XLNet ...): Roberta-large

The problem arises when using:

[x] my own modified scripts: (give details below)

The tasks I am working on is:

[x] an official GLUE/SQUaD task: GLUE SST-2

To reproduce

Steps to reproduce the behavior:

I wanted to do a hyperparameter search so I referred to https://huggingface.co/blog/ray-tune and modified the examples/text-classification/run_glue.py replacing the training part with

def model_init():
    model = AutoModelForSequenceClassification.from_pretrained(
        model_args.model_name_or_path,
        from_tf=bool(".ckpt" in model_args.model_name_or_path),
        config=config,
        cache_dir=model_args.cache_dir,
    )
    return model
trainer = Trainer(
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset if training_args.do_eval else None,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
    # Data collator will default to DataCollatorWithPadding, so we change it if we already did the padding.
    data_collator=default_data_collator if data_args.pad_to_max_length else None,
    model_init=model_init,
)

# Training
if training_args.do_train:
    from ray import tune
    import ray
    ray.init()
    best_trial = trainer.hyperparameter_search(
        hp_space=lambda _ : {"seed": tune.grid_search([31, 42, 53])},
        direction="maximize", 
        backend="ray",
    )
    logger.info(" Best run %s" % str(best_trial))

Run python run_glue.py --model_name_or_path roberta-large --do_train --do_eval --per_gpu_train_batch_size 8 --output_dir hypersearch-0 --task_name sst2 --evaluation_strategy steps --eval_steps 20 --logging_steps 10

Then the script exited with exception:

Traceback (most recent call last):
  File "run_glue.py", line 428, in <module>
    main()
  File "run_glue.py", line 359, in main
    best_trial = trainer.hyperparameter_search(
  File "/data1/howard/transformers/src/transformers/trainer.py", line 1039, in hyperparameter_search
    best_run = run_hp_search(self, n_trials, direction, **kwargs)
  File "/data1/howard/transformers/src/transformers/integrations.py", line 241, in run_hp_search_ray
    analysis = ray.tune.run(_objective, config=trainer.hp_space(None), num_samples=n_trials, **kwargs)
  File "/home/howard/anaconda3/envs/transformers/lib/python3.8/site-packages/ray/tune/tune.py", line 299, in run
    experiments[i] = Experiment(
  File "/home/howard/anaconda3/envs/transformers/lib/python3.8/site-packages/ray/tune/experiment.py", line 138, in __init__
    self._run_identifier = Experiment.register_if_needed(run)
  File "/home/howard/anaconda3/envs/transformers/lib/python3.8/site-packages/ray/tune/experiment.py", line 276, in register_if_needed
    register_trainable(name, run_object)
  File "/home/howard/anaconda3/envs/transformers/lib/python3.8/site-packages/ray/tune/registry.py", line 71, in register_trainable
    _global_registry.register(TRAINABLE_CLASS, name, trainable)
  File "/home/howard/anaconda3/envs/transformers/lib/python3.8/site-packages/ray/tune/registry.py", line 124, in register
    self.flush_values()
  File "/home/howard/anaconda3/envs/transformers/lib/python3.8/site-packages/ray/tune/registry.py", line 146, in flush_values
    _internal_kv_put(_make_key(category, key), value, overwrite=True)
  File "/home/howard/anaconda3/envs/transformers/lib/python3.8/site-packages/ray/experimental/internal_kv.py", line 27, in _internal_kv_put
    updated = worker.redis_client.hset(key, "value", value)
  File "/home/howard/anaconda3/envs/transformers/lib/python3.8/site-packages/redis/client.py", line 3004, in hset
    return self.execute_command('HSET', name, key, value)
  File "/home/howard/anaconda3/envs/transformers/lib/python3.8/site-packages/redis/client.py", line 877, in execute_command
    conn.send_command(*args)
  File "/home/howard/anaconda3/envs/transformers/lib/python3.8/site-packages/redis/connection.py", line 720, in send_command
    self.send_packed_command(self.pack_command(*args),
  File "/home/howard/anaconda3/envs/transformers/lib/python3.8/site-packages/redis/connection.py", line 712, in send_packed_command
    raise ConnectionError("Error %s while writing to socket. %s." %
redis.exceptions.ConnectionError: Error 104 while writing to socket. Connection reset by peer.

Expected behavior

The script should run without errors.

https://ray.readthedocs.io/en/latest/tune-usage.html#handling-large-datasets

howardlau1999 commented 3 years ago

I googled for the error and it may be related to sending a large object to redis. Was it because the datasets are too large?

LysandreJik commented 3 years ago

Hi! Did you try to open an issue at ray directly? It seems to be linked to their library rather than transformers

howardlau1999 commented 3 years ago

Hi! Did you try to open an issue at ray directly? It seems to be linked to their library rather than transformers

I googled and found some related issues: https://github.com/ray-project/ray/issues/2931 and according to the replies the solution is https://ray.readthedocs.io/en/latest/tune-usage.html#handling-large-datasets

But I don't know how to pass that tune.with_parameters. Maybe the Trainer should take care of this?

sgugger commented 3 years ago

It looks like something way too complex to implement so I'd suggest using optuna and see if you have the same problem, or re-implementing your own loop to use ray.tune on this. I don't think it can be supported easily by Trainer, and the documentation on the ray side is a bit too sparse on this subject to help us do it ourselves.

solatis commented 3 years ago

I have the same issue, and Optuna seems to be working fine. I think the biggest difference is that Optuna uses SQLite / in-memory, where Ray wants to send a (very large) object to Redis.

naiarapm commented 3 years ago

I don't have a solution for this problem, but just for others that might encounter the same problem, I tried the proposed solution (passing the arguments to tune.run via ray.tune.with_parameters in run_hp_search_ray) but the results were exactly the same. By what I have been able to gather, I would say that the problem arises from models bigger than 512M, not from the datasets.

richardliaw commented 3 years ago

hey folks, this should be working on the latest version of ray -- could you try installing the newest version via pip install -U ray and trying again?

naiarapm commented 3 years ago

hey folks, this should be working on the latest version of ray -- could you try installing the newest version via pip install -U ray and trying again?

Hi @richardliaw! After updating ray to the latest version (1.1.0), it still isn't working for me, although the exception stack trace has changed a little (prior to this, I got the same exception as @howardlau1999 in their first comment):

  File "/DATA/nperez/VENV/DNG/lib/python3.7/site-packages/redis/connection.py", line 706, in send_packed_command
    sendall(self._sock, item)
  File "/DATA/nperez/VENV/DNG/lib/python3.7/site-packages/redis/_compat.py", line 9, in sendall
    return sock.sendall(*args, **kwargs)
BrokenPipeError: [Errno 32] Broken pipe

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/local/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/DATA/nperez/PROJECTS/DNG/src/system/train_span_in_context.py", line 266, in <module>
    main()
  File "/DATA/nperez/PROJECTS/DNG/src/system/train_span_in_context.py", line 142, in main
    local_dir='/DATA/nperez/PROJECTS/DNG/hsearch/ray-search/'
  File "/DATA/nperez/VENV/DNG/lib/python3.7/site-packages/transformers/trainer.py", line 979, in hyperparameter_search
    best_run = run_hp_search(self, n_trials, direction, **kwargs)
  File "/DATA/nperez/VENV/DNG/lib/python3.7/site-packages/transformers/integrations.py", line 187, in run_hp_search_ray
    analysis = ray.tune.run(_objective, config=trainer.hp_space(None), num_samples=n_trials, **kwargs)
  File "/DATA/nperez/VENV/DNG/lib/python3.7/site-packages/ray/tune/tune.py", line 325, in run
    restore=restore)
  File "/DATA/nperez/VENV/DNG/lib/python3.7/site-packages/ray/tune/experiment.py", line 149, in __init__
    self._run_identifier = Experiment.register_if_needed(run)
  File "/DATA/nperez/VENV/DNG/lib/python3.7/site-packages/ray/tune/experiment.py", line 287, in register_if_needed
    register_trainable(name, run_object)
  File "/DATA/nperez/VENV/DNG/lib/python3.7/site-packages/ray/tune/registry.py", line 71, in register_trainable
    _global_registry.register(TRAINABLE_CLASS, name, trainable)
  File "/DATA/nperez/VENV/DNG/lib/python3.7/site-packages/ray/tune/registry.py", line 124, in register
    self.flush_values()
  File "/DATA/nperez/VENV/DNG/lib/python3.7/site-packages/ray/tune/registry.py", line 146, in flush_values
    _internal_kv_put(_make_key(category, key), value, overwrite=True)
  File "/DATA/nperez/VENV/DNG/lib/python3.7/site-packages/ray/experimental/internal_kv.py", line 27, in _internal_kv_put
    updated = worker.redis_client.hset(key, "value", value)
  File "/DATA/nperez/VENV/DNG/lib/python3.7/site-packages/redis/client.py", line 3050, in hset
    return self.execute_command('HSET', name, *items)
  File "/DATA/nperez/VENV/DNG/lib/python3.7/site-packages/redis/client.py", line 900, in execute_command
    conn.send_command(*args)
  File "/DATA/nperez/VENV/DNG/lib/python3.7/site-packages/redis/connection.py", line 726, in send_command
    check_health=kwargs.get('check_health', True))
  File "/DATA/nperez/VENV/DNG/lib/python3.7/site-packages/redis/connection.py", line 718, in send_packed_command
    (errno, errmsg))
redis.exceptions.ConnectionError: Error 32 while writing to socket. Broken pipe.

To be specific, in case it helps, I've been able to make hyperparameter search work for the following pre-trained models—before and after updating ray—:

dccuchile/bert-base-spanish-wwm-cased
allenai/scibert_scivocab_cased
skimai/spanberta-base-cased
distilbert-base-uncased

But not these:

bert-base-multilingual-cased
xlm-roberta-base

howardlau1999 commented 3 years ago

I couldn't get ray tune working either for roberta-large after upgrading ray to version 1.1.0 @richardliaw

richardliaw commented 3 years ago

Got it! I'll take a closer look this week. Thanks!

krfricke commented 3 years ago

Thanks for raising this issue. I could reproduce it (with roberta-large) on an AWS p2.xlarge instance. I created a PR that should fix this issue via tune.with_parameters: https://github.com/huggingface/transformers/pull/9749

@naiarapm it would be interesting to see what you did differently in your try to use tune.with_parameters - do you still have that piece of code available? We designed this utility exactly for handling large datasets and it worked for me in my experiments.

If you have the chance @howardlau1999 it would be great if you could check if this fixes your issue.

howardlau1999 commented 3 years ago

@krfricke Big thanks for your fix! I checked out your branch and the hyperparameters search with ray now works for me with roberta-large!

naiarapm commented 3 years ago

Hi @krfricke!

Sorry for the delay. In response to your question, I simply changed the following line in transformers.integrations.py (function run_hp_search_ray):

analysis = ray.tune.run(_objective, config=trainer.hp_space(None), num_samples=n_trials, **kwargs)

to this:

analysis = ray.tune.run(
    ray.tune.with_parameters(_objective),
    config=trainer.hp_space(None), num_samples=n_trials, **kwargs
)

I see now in your PR that that alone was not enough though :-) But I did not know what else to change, I just followed the suggested instructions to the best of my ability.

I can confirm as well that the error has been fixed for me. Thanks a lot!!

huggingface / transformers