Closed howardlau1999 closed 3 years ago
I googled for the error and it may be related to sending a large object to redis. Was it because the datasets are too large?
Hi! Did you try to open an issue at ray directly? It seems to be linked to their library rather than transformers
Hi! Did you try to open an issue at ray directly? It seems to be linked to their library rather than
transformers
I googled and found some related issues: https://github.com/ray-project/ray/issues/2931 and according to the replies the solution is https://ray.readthedocs.io/en/latest/tune-usage.html#handling-large-datasets
But I don't know how to pass that tune.with_parameters
. Maybe the Trainer
should take care of this?
It looks like something way too complex to implement so I'd suggest using optuna and see if you have the same problem, or re-implementing your own loop to use ray.tune
on this. I don't think it can be supported easily by Trainer
, and the documentation on the ray side is a bit too sparse on this subject to help us do it ourselves.
I have the same issue, and Optuna seems to be working fine. I think the biggest difference is that Optuna uses SQLite / in-memory, where Ray wants to send a (very large) object to Redis.
I don't have a solution for this problem, but just for others that might encounter the same problem, I tried the proposed solution (passing the arguments to tune.run
via ray.tune.with_parameters
in run_hp_search_ray
) but the results were exactly the same. By what I have been able to gather, I would say that the problem arises from models bigger than 512M, not from the datasets.
hey folks, this should be working on the latest version of ray -- could you try installing the newest version via pip install -U ray
and trying again?
hey folks, this should be working on the latest version of ray -- could you try installing the newest version via
pip install -U ray
and trying again?
Hi @richardliaw! After updating ray to the latest version (1.1.0), it still isn't working for me, although the exception stack trace has changed a little (prior to this, I got the same exception as @howardlau1999 in their first comment):
File "/DATA/nperez/VENV/DNG/lib/python3.7/site-packages/redis/connection.py", line 706, in send_packed_command
sendall(self._sock, item)
File "/DATA/nperez/VENV/DNG/lib/python3.7/site-packages/redis/_compat.py", line 9, in sendall
return sock.sendall(*args, **kwargs)
BrokenPipeError: [Errno 32] Broken pipe
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/local/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/DATA/nperez/PROJECTS/DNG/src/system/train_span_in_context.py", line 266, in <module>
main()
File "/DATA/nperez/PROJECTS/DNG/src/system/train_span_in_context.py", line 142, in main
local_dir='/DATA/nperez/PROJECTS/DNG/hsearch/ray-search/'
File "/DATA/nperez/VENV/DNG/lib/python3.7/site-packages/transformers/trainer.py", line 979, in hyperparameter_search
best_run = run_hp_search(self, n_trials, direction, **kwargs)
File "/DATA/nperez/VENV/DNG/lib/python3.7/site-packages/transformers/integrations.py", line 187, in run_hp_search_ray
analysis = ray.tune.run(_objective, config=trainer.hp_space(None), num_samples=n_trials, **kwargs)
File "/DATA/nperez/VENV/DNG/lib/python3.7/site-packages/ray/tune/tune.py", line 325, in run
restore=restore)
File "/DATA/nperez/VENV/DNG/lib/python3.7/site-packages/ray/tune/experiment.py", line 149, in __init__
self._run_identifier = Experiment.register_if_needed(run)
File "/DATA/nperez/VENV/DNG/lib/python3.7/site-packages/ray/tune/experiment.py", line 287, in register_if_needed
register_trainable(name, run_object)
File "/DATA/nperez/VENV/DNG/lib/python3.7/site-packages/ray/tune/registry.py", line 71, in register_trainable
_global_registry.register(TRAINABLE_CLASS, name, trainable)
File "/DATA/nperez/VENV/DNG/lib/python3.7/site-packages/ray/tune/registry.py", line 124, in register
self.flush_values()
File "/DATA/nperez/VENV/DNG/lib/python3.7/site-packages/ray/tune/registry.py", line 146, in flush_values
_internal_kv_put(_make_key(category, key), value, overwrite=True)
File "/DATA/nperez/VENV/DNG/lib/python3.7/site-packages/ray/experimental/internal_kv.py", line 27, in _internal_kv_put
updated = worker.redis_client.hset(key, "value", value)
File "/DATA/nperez/VENV/DNG/lib/python3.7/site-packages/redis/client.py", line 3050, in hset
return self.execute_command('HSET', name, *items)
File "/DATA/nperez/VENV/DNG/lib/python3.7/site-packages/redis/client.py", line 900, in execute_command
conn.send_command(*args)
File "/DATA/nperez/VENV/DNG/lib/python3.7/site-packages/redis/connection.py", line 726, in send_command
check_health=kwargs.get('check_health', True))
File "/DATA/nperez/VENV/DNG/lib/python3.7/site-packages/redis/connection.py", line 718, in send_packed_command
(errno, errmsg))
redis.exceptions.ConnectionError: Error 32 while writing to socket. Broken pipe.
To be specific, in case it helps, I've been able to make hyperparameter search work for the following pre-trained models—before and after updating ray—:
But not these:
I couldn't get ray tune working either for roberta-large after upgrading ray to version 1.1.0 @richardliaw
Got it! I'll take a closer look this week. Thanks!
Thanks for raising this issue. I could reproduce it (with roberta-large
) on an AWS p2.xlarge instance. I created a PR that should fix this issue via tune.with_parameters
: https://github.com/huggingface/transformers/pull/9749
@naiarapm it would be interesting to see what you did differently in your try to use tune.with_parameters
- do you still have that piece of code available? We designed this utility exactly for handling large datasets and it worked for me in my experiments.
If you have the chance @howardlau1999 it would be great if you could check if this fixes your issue.
@krfricke Big thanks for your fix! I checked out your branch and the hyperparameters search with ray
now works for me with roberta-large
!
Hi @krfricke!
Sorry for the delay. In response to your question, I simply changed the following line in transformers.integrations.py
(function run_hp_search_ray
):
analysis = ray.tune.run(_objective, config=trainer.hp_space(None), num_samples=n_trials, **kwargs)
to this:
analysis = ray.tune.run(
ray.tune.with_parameters(_objective),
config=trainer.hp_space(None), num_samples=n_trials, **kwargs
)
I see now in your PR that that alone was not enough though :-) But I did not know what else to change, I just followed the suggested instructions to the best of my ability.
I can confirm as well that the error has been fixed for me. Thanks a lot!!
Environment info
transformers
version: 4.1.0.dev0Who can help
@sgugger
Information
Model I am using (Bert, XLNet ...): Roberta-large
The problem arises when using:
The tasks I am working on is:
To reproduce
Steps to reproduce the behavior:
I wanted to do a hyperparameter search so I referred to https://huggingface.co/blog/ray-tune and modified the
examples/text-classification/run_glue.py
replacing the training part withRun
python run_glue.py --model_name_or_path roberta-large --do_train --do_eval --per_gpu_train_batch_size 8 --output_dir hypersearch-0 --task_name sst2 --evaluation_strategy steps --eval_steps 20 --logging_steps 10
Then the script exited with exception:
Expected behavior
The script should run without errors.
Related Issues
https://github.com/ray-project/ray/issues/2931
https://ray.readthedocs.io/en/latest/tune-usage.html#handling-large-datasets