ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.61k stars 5.71k forks source link

Hyperparameter tuning with Conversational AI Models #16878

Closed ericharper closed 3 years ago

ericharper commented 3 years ago

Hi,

We have multiple NeMo users that are interested in using Ray Tune with PyTorch Lightning. NeMo also uses PTL so it's a natural idea to leverage Ray Tune with NeMo: https://github.com/NVIDIA/NeMo/issues/2442, https://github.com/NVIDIA/NeMo/issues/2376

However, there is an issue with the model not being able to be pickled. Can Ray Tune + PTL be used with the large models that are used in Conversational AI?

richardliaw commented 3 years ago

Hey @ericharper, thanks for raising this up!

I just took a look at the issues, and it seems like there's two separate threads:

  1. Distributed fine-tuning using Ray Lightning Plugins.
  2. Distributed hyperparameter tuning using Ray Tune.

RE: distributed fine-tuning with Ray Lightning, it seems like there's some sort of serialization error for NeMo. Lightning requires the model to be instantiated before parallelization, and Ray will transfer the model through RPC, serializing it with cloudpickle. What's the fundamental reason for not being able to pickle? Is it some IO mechanism or underlying library?

cc @amogkam

RE: hyperparameter tuning, users should be able to use Ray Tune without running into the pickle problem. Specifically, we require users to instantiate the model in the function, so you would never need to move it around:

def train(hyperparameters):
   model = create_nemo_model(hyperparameters)
   return model.train()

ray.tune.run(train, resources={"GPU": 1})

I hope that gives a bit of clarity to answer your question! Happy to help out any way we can.

Amels404 commented 3 years ago

Thank you @ericharper for opening this! @richard4912 I'm actually more interested in the Hyperparamter part from my side.
I've tried to do as suggested in this: https://docs.ray.io/en/master/tune/tutorials/tune-pytorch-lightning.html The issue is that I get an error telling me that:

TypeError: can't pickle _thread.lock objects Other options: -Try reproducing the issue by callingpickle.dumps(trainable). -If the error is typing-related, try removing the type annotations and try again.

I tried to add the pickle.dumps(trainable) part but that did not work. Thank you!

richardliaw commented 3 years ago

@Amels404 do you mind posting your stack trace and also your training function?

malloc-naski commented 3 years ago

Thank you @richardliaw and @ericharper. My issue is related to BERT pretraining with ray_lightning and the BERTLMModel not being serializable apparently. I discussed this with @amogkam who suggested asking NeMo's team about it (#2376).

Amels404 commented 3 years ago

Yes sure, @richard4912

The cfg is a yaml file containing the parameters of the model.

` @hydra_runner(config_path="configs", config_name=CONFIG_NAME) def create_nemo_model(cfg): logging.info(f"Hydra config: {OmegaConf.to_yaml(cfg)}") callbacks = [ TuneReportCallback( {"wer": "val_wer"}, on="validation_end" ) ] trainer = pl.Trainer(**cfg.trainer, callbacks=callbacks) asr_model = EncDecCTCModel(cfg=cfg.model, trainer=trainer) return trainer.fit(asr_model)

def train(config): model = create_nemo_model(config) return model.train()

config = { "lr": tune.loguniform(1e-4, 1e-1), "batch_size": tune.choice([32, 64, 128]), }

ray.tune.run(train, config=config) `

Amels404 commented 3 years ago

I'm sharing the trace using a better format. @richardliaw @amogkam

Thank you!

[NeMo W 2021-07-09 10:20:00 nemo_logging:349] /home/xxx/anaconda3/envs/xxx/lib/python3.6/site-packages/torchaudio/backend/utils.py:54: UserWarning: "sox" backend is being deprecated. The default backend will be changed to "sox_io" backend in 0.8.0 and "sox" backend will be removed in 0.9.0. Please migrate to "sox_io" backend. Please refer to https://github.com/pytorch/audio/issues/903 for the detail.
'"sox" backend is being deprecated. '
                                                                                                                                                                                                                                                                                                                                                                                                                                               [NeMo W 2021-07-09 10:20:00 nemo_logging:349] /home/xxxx/anaconda3/envs/xxxx/lib/python3.6/site-packages/ray/autoscaler/_private/cli_logger.py:61: FutureWarning: Not all Ray CLI dependencies were found. In Ray 1.4+, the Ray CLI, autoscaler, and dashboard will only be usable via `pip install 'ray[default]'`. Please update your install command.                                                                                                                                         "update your install command.", FutureWarning)    
                                                                                                                                                                                                                                                                                                                                                                                                                                  *************************************************************************************                                                                                                                                                        2021-07-09 10:20:01,468 INFO services.py:1274 -- View the Ray dashboard at http://127.0.0.1:8265      
                                                                                                                                       2021-07-09 10:20:03,328 WARNING function_runner.py:545 -- Function checkpointing is disabled. This may result in unexpected behavior when using checkpointing features or certain schedulers. To enable, set the train function arguments to be `func(config, checkpoint_dir=None)`.        
                                                                                                                                                                                              ================================================================================                                                                                                                                                             Checking Serializability of <class 'ray.tune.function_runner.wrap_function.<locals>.ImplicitFunc'>    
                                                                                                                                       ================================================================================               
                                                                                                                                              !!! FAIL serialization: can't pickle _thread.lock objects
Serializing '__init__' <function Trainable.__init__ at 0x7f00225c6158>...    
Serializing '_close_logfiles' <function Trainable._close_logfiles at 0x7f00225c6a60>...                                                                                                                                                      Serializing '_create_logger' <function Trainable._create_logger at 0x7f00225c6950>...                                                                                                                                                        Serializing '_export_model' <function Trainable._export_model at 0x7f00225c7268>...                                                                                                                                                          Serializing '_implements_method' <function Trainable._implements_method at 0x7f00225c72f0>...                                                                                                                                                Serializing '_open_logfiles' <function Trainable._open_logfiles at 0x7f00225c69d8>...                                                                                                                                                        Serializing '_report_thread_runner_error' <function FunctionRunner._report_thread_runner_error at 0x7f002251bae8>...                                                                                                                         Serializing '_start' <function FunctionRunner._start at 0x7f002251b620>...                                                                                                                                                                   Serializing '_trainable_func' <function wrap_function.<locals>.ImplicitFunc._trainable_func at 0x7f00211c71e0>...                                                                                                                            !!! FAIL serialization: can't pickle _thread.lock objects                                                                                                                                                                                    Detected 3 global variables. Checking serializability...                                                                                                                                                                                         Serializing 'partial' <class 'functools.partial'>...                                                                                                                                                                                         Serializing 'inspect' <module 'inspect' from '/home/xxx/anaconda3/envs/xxx/lib/python3.6/inspect.py'>...                                                                                                                                    Serializing 'RESULT_DUPLICATE' __duplicate__...                                                                                                                                                                                          Detected 3 nonlocal variables. Checking serializability...                                                                                                                                                                                       Serializing 'train_func' <function train at 0x7f0143dc41e0>...                                                                                                                                                                               !!! FAIL serialization: can't pickle _thread.lock objects                                                                                                                                                                                    Detected 2 global variables. Checking serializability...                                                                                                                                                                                         Serializing 'create_nemo_model' <function create_nemo_model at 0x7f0022457950>...                                                                                                                                                            !!! FAIL serialization: can't pickle _thread.lock objects                                                                                                                                                                            Serializing '_close_logfiles' <function Trainable._close_logfiles at 0x7f00225c6a60>...                                                                                                                                                  ================================================================================                                                                                                                                                             Variable:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         FailTuple(create_nemo_model [obj=<function create_nemo_model at 0x7f0022457950>, parent=<function train at 0x7f0143dc41e0>])                                                                                                                                                                                                                                                                                                                                                      was found to be non-serializable. There may be multiple other undetected variables that were non-serializable.                                                                                                                               Consider either removing the instantiation/imports of these variables or moving the instantiation into the scope of the function/class.                                                                                                      If you have any suggestions on how to improve this error message, please reach out to the Ray developers on github.com/ray-project/ray/issues/                                                                                               ================================================================================                                                                                                                                                             Traceback (most recent call last):                                                                                                                                                                                                             File "xxx/asr/ray_tune_nemo.py", line 51, in <module>                                                                                                                                                                                          ray.tune.run(train, config=params)                                                                                                                                                                                                         File "/home/xxxx/anaconda3/envs/xxx/lib/python3.6/site-packages/ray/tune/tune.py", line 410, in run                                                                                                                                            restore=restore)                                                                                                                                                                                                                           File "/home/xxx/anaconda3/envs/xxx/lib/python3.6/site-packages/ray/tune/experiment.py", line 151, in __init__                                                                                                                                 self._run_identifier = Experiment.register_if_needed(run)                                                                                                                                                                                  File "/home/xxxx/anaconda3/envs/xxxx/lib/python3.6/site-packages/ray/tune/experiment.py", line 303, in register_if_needed                                                                                                                       raise type(e)(str(e) + " " + extra_msg) from None                                                                                                                                                                                        TypeError: can't pickle _thread.lock objects Other options:                                                                                                                                                                                  -Try reproducing the issue by calling `pickle.dumps(trainable)`.                                                                                                                                                                             -If the error is typing-related, try removing the type annotations and try again.            
richardliaw commented 3 years ago

Hmm, just to diagnose the issue, can you try commenting out hydra_runner?


# @hydra_runner(config_path="configs", config_name=CONFIG_NAME)
def create_nemo_model(cfg):
logging.info(f"Hydra config: {OmegaConf.to_yaml(cfg)}")
callbacks = [
TuneReportCallback(
{"wer": "val_wer"}, on="validation_end"
)
]
trainer = pl.Trainer(**cfg.trainer,
callbacks=callbacks)
asr_model = EncDecCTCModel(cfg=cfg.model, trainer=trainer)
return trainer.fit(asr_model)

def train(config):
model = create_nemo_model(config)
return model.train()

config = {
"lr": tune.loguniform(1e-4, 1e-1),
"batch_size": tune.choice([32, 64, 128]),
}

ray.tune.run(train, config=config)
richardliaw commented 3 years ago

The main issue is that there's something that is not serializable in create_nemo_model. I'm not sure what it is, but in order to fix that, you will either need to disable Tune's parallelism or convert the function to be serializable: see Cloudpickle for more details.

Amels404 commented 3 years ago

Thanks @richardliaw I've tried as you suggested, I think the issue was related to the logging part. Then I got another issue "ray.tune.error.TuneError: Wrapped function ran until completion without reporting results or raising an exception" I've tried to disable distributed training and I've tried to move reading the config file in outer function but that did not seem to resolve the issue.

Thank you!

*** 2021-07-13 12:03:03,873 INFO services.py:1274 -- View the Ray dashboard at http://127.0.0.1:8265 ****------------------------------------ 2021-07-13 12:03:05,755 WARNING function_runner.py:546 -- Function checkpointing is disabled. This may result in unexpected behavior when using checkpointing features or certain schedulers. To enable, set the train function arguments to be func(config, checkpoint_dir=None). 2021-07-13 12:03:05,792 WARNING tune.py:494 -- Tune detects GPUs, but no trials are using GPUs. To enable trials to use GPUs, set tune.run(resources_per_trial={'gpu': 1}...) which allows Tune to expose 1 GPU to each trial. You can also override Trainable.default_resource_request if using the Trainable API. == Status == Memory usage on this node: 11.5/15.6 GiB Using FIFO scheduling algorithm. Resources requested: 0/12 CPUs, 0/2 GPUs, 0.0/3.18 GiB heap, 0.0/1.59 GiB objects (0.0/1.0 accelerator_type:GTX) Result logdir: /home/xxxx/ray_results/train_2021-07-13_12-03-05 Number of trials: 1/1 (1 PENDING) (pid=21923) [NeMo W 2021-07-13 12:03:09 experimental:28] Module <class 'nemo.collections.asr.data.audio_to_text_dali.AudioToCharDALIDataset'> is experimental, not ready for production and is not fully supported. Use at your own risk. (pid=21923) ################################################################################ (pid=21923) ### WARNING, path does not exist: KALDI_ROOT=/mnt/matylda5/iveselyk/Tools/kaldi-trunk (pid=21923) ### (please add 'export KALDI_ROOT=' in your $HOME/.profile) (pid=21923) ### (or run as: KALDI_ROOT= python .py) (pid=21923) ################################################################################ (pid=21923) (pid=21923) [NeMo W 2021-07-13 12:03:09 nemo_logging:349] /home/xxx/anaconda3/envs/xxx/lib/python3.6/site-packages/torchaudio/backend/utils.py:54: UserWarning: "sox" backend is being deprecated. The default backend will be changed to "sox_io" backend in 0.8.0 and "sox" backend will be removed in 0.9.0. Please migrate to "sox_io" backend. Please refer to https://github.com/pytorch/audio/issues/903 for the detail. (pid=21923) '"sox" backend is being deprecated. ' (pid=21923) (pid=21923) usage: default_worker.py [--help] [--hydra-help] [--version] (pid=21923) [--cfg {job,hydra,all}] [--package PACKAGE] [--run] (pid=21923) [--multirun] [--shell-completion] (pid=21923) [--config-path CONFIG_PATH] (pid=21923) [--config-name CONFIG_NAME] [--config-dir CONFIG_DIR] (pid=21923) [--info] (pid=21923) [overrides [overrides ...]] (pid=21923) default_worker.py: error: unrecognized arguments: --node-ip-address=192.168.88.54 --node-manager-port=34665 --object-store-name=/tmp/ray/session_2021-07-13_12-03-02_647923_21784/sockets/plasma_store --raylet-name=/tmp/ray/session_2021-07-13_12-03-02_647923_21784/sockets/raylet --redis-address=192.168.88.54:6379 --temp-dir=/tmp/ray --metrics-agent-port=59000 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --redis-password=5241590000000000 --runtime-env-hash=0 2021-07-13 12:03:10,535 ERROR trial_runner.py:748 -- Trial train_e65b5_00000: Error processing event. Traceback (most recent call last): File "/home/xxx/anaconda3/envs/xxx/lib/python3.6/site-packages/ray/tune/trial_runner.py", line 718, in _process_trial results = self.trial_executor.fetch_result(trial) File "/home/xxxx/anaconda3/envs/xxx/lib/python3.6/site-packages/ray/tune/ray_trial_executor.py", line 688, in fetch_result result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT) File "/home/xxxx/anaconda3/envs/xxx/lib/python3.6/site-packages/ray/_private/client_mode_hook.py", line 62, in wrapper return func(*args, kwargs) File "/home/xxxx/anaconda3/envs/xxxx/lib/python3.6/site-packages/ray/worker.py", line 1494, in get raise value.as_instanceof_cause() ray.exceptions.RayTaskError(TuneError): ray::ImplicitFunc.train_buffered() (pid=21923, ip=192.168.88.54) File "python/ray/_raylet.pyx", line 501, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 451, in ray._raylet.execute_task.function_executor File "/home/xxx/anaconda3/envs/xxx/lib/python3.6/site-packages/ray/_private/function_manager.py", line 563, in actor_method_executor return method(__ray_actor, *args, **kwargs) File "/home/xxxx/anaconda3/envs/xxxx/lib/python3.6/site-packages/ray/tune/trainable.py", line 173, in train_buffered result = self.train() File "/home/xxxx/anaconda3/envs/xxx/lib/python3.6/site-packages/ray/tune/trainable.py", line 232, in train result = self.step() File "/home/xxxx/anaconda3/envs/xxxx/lib/python3.6/site-packages/ray/tune/function_runner.py", line 374, in step ("Wrapped function ran until completion without reporting " ray.tune.error.TuneError: Wrapped function ran until completion without reporting results or raising an exception. The trial train_e65b5_00000 errored with parameters={'lr': 0.002076368644084128, 'batch_size': 64}. Error file: /home/xxx/ray_results/train_2021-07-13_12-03-05/train_e65b5_00000_0_batch_size=64,lr=0.0020764_2021-07-13_12-03-05/error.txt == Status == Memory usage on this node: 12.0/15.6 GiB Using FIFO scheduling algorithm. Resources requested: 0/12 CPUs, 0/2 GPUs, 0.0/3.18 GiB heap, 0.0/1.59 GiB objects (0.0/1.0 accelerator_type:GTX) Result logdir: /home/xxxx/ray_results/train_2021-07-13_12-03-05 Number of trials: 1/1 (1 ERROR) +-------------------+----------+-------+--------------+------------+ | Trial name | status | loc | batch_size | lr | |-------------------+----------+-------+--------------+------------| | train_e65b5_00000 | ERROR | | 64 | 0.00207637 | +-------------------+----------+-------+--------------+------------+ Number of errored trials: 1 +-------------------+--------------+-------------------------------------------------------------------------------------------------------------------------------+ | Trial name | # failures | error file | |-------------------+--------------+-------------------------------------------------------------------------------------------------------------------------------| | train_e65b5_00000 | 1 | /home/xxxx/ray_results/train_2021-07-13_12-03-05/train_e65b5_00000_0_batch_size=64,lr=0.0020764_2021-07-13_12-03-05/error.txt | +-------------------+--------------+-------------------------------------------------------------------------------------------------------------------------------+ Traceback (most recent call last): File "xxx/asr/ray_tune_nemo.py", line 63, in ray.tune.run(train, config=params, num_samples=1, resources_per_trial={"cpu": 1}, verbose=2) File "/home/xxx/anaconda3/envs/xxx/lib/python3.6/site-packages/ray/tune/tune.py", line 543, in run raise TuneError("Trials did not complete", incomplete_trials) ray.tune.error.TuneError: ('Trials did not complete', [train_e65b5_00000])

richardliaw commented 3 years ago

OK, so this error comes up because hydra_runner seems to fail inside your training job.

Is the primary benefit of using Hydra to convert YAML -> a structured object? Hydra generally assumes it is a top level construct (i.e., put Tune inside the Hydra call).

Amels404 commented 3 years ago

Thank you for your reply,

I'm not sure if this is what you mean by put tune inside Hydra call, I think I'm missing something out because I have the same issue.

def create_nemo_model():   
    cfg, config = import_conf()
    callbacks = [
        TuneReportCallback(
            {"wer": "val_wer"}, on="validation_end"
        )
    ]
    trainer = pl.Trainer(config,         
        logger=TensorBoardLogger(
            save_dir=tune.get_trial_dir(), name="", version="."),
            progress_bar_refresh_rate=0,
        callbacks=callbacks)
    asr_model = EncDecCTCModel(cfg=cfg.model, trainer=trainer)
    return trainer.fit(asr_model)

@hydra_runner(config_path="configs", config_name=CONFIG_NAME)
def import_conf(cfg):
    print(f"Hydra config: {OmegaConf.to_yaml(cfg)}") 
    a_yaml_file = open("/path/to/model.yaml")
    config = yaml.load(a_yaml_file, Loader=yaml.FullLoader)
    ray.tune.run(train, config=params)
    return cfg, config

params = {
"lr": tune.loguniform(1e-4, 1e-1),
"batch_size": tune.choice([1]),
}

def train(params):
   model = create_nemo_model()
   return model.train()

train(params)
richardliaw commented 3 years ago

Hmm this one looks more promising to me. Can you post the stack trace? BTW, use three backticks to format your code: ```

Amels404 commented 3 years ago

Yes, sure, thanks for the note! here is the stack trace:

2021-07-15 16:01:30,268 INFO services.py:1274 -- View the Ray dashboard at http://127.0.0.1:8265 ****------------------------------------ 2021-07-15 16:01:32,193 WARNING function_runner.py:546 -- Function checkpointing is disabled. This may result in unexpected behavior when using checkpointing features or certain schedulers. To enable, set the train function arguments to be func(config, checkpoint_dir=None). 2021-07-15 16:01:32,301 WARNING tune.py:494 -- Tune detects GPUs, but no trials are using GPUs. To enable trials to use GPUs, set tune.run(resources_per_trial={'gpu': 1}...) which allows Tune to expose 1 GPU to each trial. You can also override Trainable.default_resource_request if using the Trainable API. == Status == Memory usage on this node: 5.2/15.6 GiB Using FIFO scheduling algorithm. Resources requested: 0/12 CPUs, 0/2 GPUs, 0.0/6.93 GiB heap, 0.0/3.46 GiB objects (0.0/1.0 accelerator_type:GTX) Result logdir: /home/xxx/ray_results/train_2021-07-15_16-01-32 Number of trials: 1/1 (1 PENDING) +-------------------+----------+-------+--------------+-------------+ | Trial name | status | loc | batch_size | lr | |-------------------+----------+-------+--------------+-------------| | train_8a807_00000 | PENDING | | 1 | 0.000930923 | +-------------------+----------+-------+--------------+-------------+ (pid=26250) [NeMo W 2021-07-15 16:01:34 experimental:28] Module <class 'nemo.collections.asr.data.audio_to_text_dali.AudioToCharDALIDataset'> is experimental, not ready for production and is not fully supported. Use at your own risk. (pid=26250) ################################################################################ (pid=26250) ### WARNING, path does not exist: KALDI_ROOT=/mnt/matylda5/iveselyk/Tools/kaldi-trunk (pid=26250) ### (please add 'export KALDI_ROOT=' in your $HOME/.profile) (pid=26250) ### (or run as: KALDI_ROOT= python .py) (pid=26250) ################################################################################ (pid=26250) (pid=26250) [NeMo W 2021-07-15 16:01:34 nemo_logging:349] /home/xxxx/anaconda3/envs/aaj/lib/python3.6/site-packages/torchaudio/backend/utils.py:54: UserWarning: "sox" backend is being deprecated. The default backend will be changed to "sox_io" backend in 0.8.0 and "sox" backend will be removed in 0.9.0. Please migrate to "sox_io" backend. Please refer to https://github.com/pytorch/audio/issues/903 for the detail. (pid=26250) '"sox" backend is being deprecated. ' (pid=26250) (pid=26250) usage: default_worker.py [--help] [--hydra-help] [--version] (pid=26250) [--cfg {job,hydra,all}] [--package PACKAGE] [--run] (pid=26250) [--multirun] [--shell-completion] (pid=26250) [--config-path CONFIG_PATH] (pid=26250) [--config-name CONFIG_NAME] [--config-dir CONFIG_DIR] (pid=26250) [--info] (pid=26250) [overrides [overrides ...]] (pid=26250) default_worker.py: error: unrecognized arguments: --node-ip-address=192.168.88.54 --node-manager-port=45821 --object-store-name=/tmp/ray/session_2021-07-15_16-01-29_079861_26116/sockets/plasma_store --raylet-name=/tmp/ray/session_2021-07-15_16-01-29_079861_26116/sockets/raylet --redis-address=192.168.88.54:6379 --temp-dir=/tmp/ray --metrics-agent-port=53112 --logging-rotate-bytes=536870912 --logging-rotate-backup-count=5 --redis-password=5241590000000000 --runtime-env-hash=0 2021-07-15 16:01:36,100 ERROR trial_runner.py:748 -- Trial train_8a807_00000: Error processing event. Traceback (most recent call last): File "/home/xxx/anaconda3/envs/xxxx/lib/python3.6/site-packages/ray/tune/trial_runner.py", line 718, in _process_trial results = self.trial_executor.fetch_result(trial) File "/home/xxxx/anaconda3/envs/xxxx/lib/python3.6/site-packages/ray/tune/ray_trial_executor.py", line 688, in fetch_result result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT) File "/home/xxxx/anaconda3/envs/xxxx/lib/python3.6/site-packages/ray/_private/client_mode_hook.py", line 62, in wrapper return func(*args, *kwargs) File "/home/xxxx/anaconda3/envs/xxxx/lib/python3.6/site-packages/ray/worker.py", line 1494, in get raise value.as_instanceof_cause() ray.exceptions.RayTaskError(TuneError): ray::ImplicitFunc.train_buffered() (pid=26250, ip=192.168.88.54) File "python/ray/_raylet.pyx", line 501, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 451, in ray._raylet.execute_task.function_executor File "/home/xxxx/anaconda3/envs/xxxx/lib/python3.6/site-packages/ray/_private/function_manager.py", line 563, in actor_method_executor return method(__ray_actor, args, **kwargs) File "/home/xxxxx/anaconda3/envs/xxxx/lib/python3.6/site-packages/ray/tune/trainable.py", line 173, in train_buffered result = self.train() File "/home/xxxx/anaconda3/envs/xxxx/lib/python3.6/site-packages/ray/tune/trainable.py", line 232, in train result = self.step() File "/home/xxxx/anaconda3/envs/xxxx/lib/python3.6/site-packages/ray/tune/function_runner.py", line 374, in step ("Wrapped function ran until completion without reporting " ray.tune.error.TuneError: Wrapped function ran until completion without reporting results or raising an exception. Result for train_8a807_00000: {} == Status == Memory usage on this node: 5.7/15.6 GiB Using FIFO scheduling algorithm. Resources requested: 0/12 CPUs, 0/2 GPUs, 0.0/6.93 GiB heap, 0.0/3.46 GiB objects (0.0/1.0 accelerator_type:GTX) Result logdir: /home/xxxx/ray_results/train_2021-07-15_16-01-32 Number of trials: 1/1 (1 ERROR) +-------------------+----------+-------+--------------+-------------+ | Trial name | status | loc | batch_size | lr | |-------------------+----------+-------+--------------+-------------| | train_8a807_00000 | ERROR | | 1 | 0.000930923 | +-------------------+----------+-------+--------------+-------------+ Number of errored trials: 1 +-------------------+--------------+-------------------------------------------------------------------------------------------------------------------------------+ | Trial name | # failures | error file | |-------------------+--------------+-------------------------------------------------------------------------------------------------------------------------------| | train_8a807_00000 | 1 | /home/x/ray_results/train_2021-07-15_16-01-32/train_8a807_00000_0_batch_size=1,lr=0.00093092_2021-07-15_16-01-32/error.txt | +-------------------+--------------+-------------------------------------------------------------------------------------------------------------------------------+ Traceback (most recent call last): File "aaj/asr/ray_tune_nemo.py", line 59, in train(params) File "aaj/asr/ray_tune_nemo.py", line 56, in train model = create_nemo_model() File "aaj/asr/ray_tune_nemo.py", line 27, in create_nemo_model cfg, config = import_conf() File "/home/x/anaconda3/envs/x/lib/python3.6/site-packages/nemo/core/config/hydra_runner.py", line 103, in wrapper strict=None, File "/home/x/anaconda3/envs/x/lib/python3.6/site-packages/hydra/_internal/utils.py", line 347, in _run_hydra lambda: hydra.run( File "/home/x/anaconda3/envs/x/lib/python3.6/site-packages/hydra/_internal/utils.py", line 201, in run_and_report raise ex File "/home/x/anaconda3/envs/x/lib/python3.6/site-packages/hydra/_internal/utils.py", line 198, in run_and_report return func() File "/home/x/anaconda3/envs/x/lib/python3.6/site-packages/hydra/_internal/utils.py", line 350, in overrides=args.overrides, File "/home/x/anaconda3/envs/x/lib/python3.6/site-packages/hydra/_internal/hydra.py", line 112, in run configure_logging=with_log_configuration, File "/home/x/anaconda3/envs/x/lib/python3.6/site-packages/hydra/core/utils.py", line 125, in run_job ret.return_value = task_function(task_cfg) File "aaj/asr/ray_tune_nemo.py", line 46, in import_conf ray.tune.run(train, config=params) File "/home/x/anaconda3/envs/x/lib/python3.6/site-packages/ray/tune/tune.py", line 543, in run raise TuneError("Trials did not complete", incomplete_trials) ray.tune.error.TuneError: ('Trials did not complete', [train_8a807_00000])

richardliaw commented 3 years ago

@Amels404 what does hydra_runner do? Do you have the source code for that that you can share?

Amels404 commented 3 years ago

Yes, sure @richardliaw, we use hydra_runner from the nemo package that you can find here: https://github.com/NVIDIA/NeMo/blob/v1.0.0rc1/nemo/core/config/hydra_runner.py

and the imports in our file are like this:

from nemo.core.config import hydra_runner
from omegaconf import OmegaConf
richardliaw commented 3 years ago

OK thanks!

@hydra_runner(config_path="configs", config_name=CONFIG_NAME)
def import_conf(cfg):
    print(f"Hydra config: {OmegaConf.to_yaml(cfg)}") 
    a_yaml_file = open("/path/to/model.yaml")
    config = yaml.load(a_yaml_file, Loader=yaml.FullLoader)
    ray.tune.run(train, config=params)
    return cfg, config

params = {
"lr": tune.loguniform(1e-4, 1e-1),
"batch_size": tune.choice([1]),
}

Can you tell me more about what this does? I see you have a params dict along with a config dict; how do you want to merge these two?

richardliaw commented 3 years ago

Here's an attempt at getting you a bit farther:

# 
def create_nemo_model(cfg):
    logging.info(f"Hydra config: {OmegaConf.to_yaml(cfg)}")
    callbacks = [
        TuneReportCallback(
            {"wer": "val_wer"}, on="validation_end"
        )
    ]
    trainer = pl.Trainer(**cfg.trainer,
        callbacks=callbacks)
    asr_model = EncDecCTCModel(cfg=cfg.model, trainer=trainer)
    return trainer.fit(asr_model)

def train(config):
    model = create_nemo_model(config["hydraconfig"])
    return model.train()

@hydra_runner(config_path="configs", config_name=CONFIG_NAME)
def tune_function(cfg):
    # cfg is a omegaconf object?
    config = {
        "lr": tune.loguniform(1e-4, 1e-1),
        "batch_size": tune.choice([32, 64, 128]),
    }
    config["hydraconfig"] = cfg
    ray.tune.run(train, config=config)
Amels404 commented 3 years ago

@richardliaw I'm sorry for getting back to you late.

I think the issue that I have is: how to incorporate the ray config to the omega config file (precisely omegaconf.dictconfig.DictConfig), this is an example of the nemo config file:

https://github.com/NVIDIA/NeMo/tree/main/examples/asr/conf

I tried as you suggested but it doesn't seem to work, here is the trace:

    Traceback (most recent call last):                                                                                                                                                              File "x/asr/attempt.py", line 53, in <module>                                                                                                                                                 tune()                                                                                                                                                                                      File "/home/x/anaconda3/envs/x/lib/python3.6/site-packages/nemo/core/config/hydra_runner.py", line 103, in wrapper                                                                         strict=None,                                                                                                                                                                                File "/home/x/anaconda3/envs/x/lib/python3.6/site-packages/hydra/_internal/utils.py", line 347, in _run_hydra                                                                              lambda: hydra.run(                                                                                                                                                                          File "/home/x/anaconda3/envs/x/lib/python3.6/site-packages/hydra/_internal/utils.py", line 201, in run_and_report                                                                          raise ex                                                                                                                                                                                    File "/home/x/anaconda3/envs/x/lib/python3.6/site-packages/hydra/_internal/utils.py", line 198, in run_and_report                                                                          return func()                                                                                                                                                                               File "/home/x/anaconda3/envs/x/lib/python3.6/site-packages/hydra/_internal/utils.py", line 350, in <lambda>                                                                                overrides=args.overrides,                                                                                                                                                                   File "/home/x/anaconda3/envs/x/lib/python3.6/site-packages/hydra/_internal/hydra.py", line 112, in run                                                                                     configure_logging=with_log_configuration,                                                                                                                                                   File "/home/x/anaconda3/envs/x/lib/python3.6/site-packages/hydra/core/utils.py", line 125, in run_job                                                                                      ret.return_value = task_function(task_cfg)                                                                                                                                                  File "x/asr/file.py", line 47, in tune                                                                                                                                                     "lr": tune.loguniform(1e-4, 1e-1),                                                                                                                                                        AttributeError: 'function' object has no attribute 'loguniform'    

thanks! 
richardliaw commented 3 years ago

Awesome! I think we're close. It seems like we've overwritten Tune as a function.

Could you try:

def create_nemo_model(cfg):
    logging.info(f"Hydra config: {OmegaConf.to_yaml(cfg)}")
    callbacks = [
        TuneReportCallback(
            {"wer": "val_wer"}, on="validation_end"
        )
    ]
    trainer = pl.Trainer(**cfg.trainer,
        callbacks=callbacks)
    asr_model = EncDecCTCModel(cfg=cfg.model, trainer=trainer)
    return trainer.fit(asr_model)

def train(config):
    model = create_nemo_model(config["hydraconfig"])
    return model.train()

@hydra_runner(config_path="configs", config_name=CONFIG_NAME)
def tune_function(cfg):
    # cfg is a omegaconf object?
    config = {
        "lr": ray.tune.loguniform(1e-4, 1e-1),
        "batch_size": ray.tune.choice([32, 64, 128]),
    }
    config["hydraconfig"] = cfg
    ray.tune.run(train, config=config)
Amels404 commented 3 years ago

Yes, sure. You can find also the print results of the vars and dir, I'm not sure if this might be useful. Thanks!

<function tune at 0x7fd826b67d90> {'__wrapped__': <function tune at 0x7fd826b67d08>} ['__annotations__', '__call__', '__class__', '__closure__', '__code__', '__defaults__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__get__', '__getattribute__', '__globals__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__kwdefaults__', '__le__', '__lt__', '__module__', '__name__', '__ne__', '__new__', '__qualname__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__wrapped__'] {} Traceback (most recent call last): File "aaj/asr/attempt.py", line 59, in <module> tune() File "/home/x/anaconda3/envs/x/lib/python3.6/site-packages/nemo/core/config/hydra_runner.py", line 103, in wrapper strict=None, File "/home/x/anaconda3/envs/x/lib/python3.6/site-packages/hydra/_internal/utils.py", line 347, in _run_hydra lambda: hydra.run( File "/home/x/anaconda3/envs/x/lib/python3.6/site-packages/hydra/_internal/utils.py", line 201, in run_and_report raise ex File "/home/x/anaconda3/envs/x/lib/python3.6/site-packages/hydra/_internal/utils.py", line 198, in run_and_report return func() File "/home/x/anaconda3/envs/x/lib/python3.6/site-packages/hydra/_internal/utils.py", line 350, in <lambda> overrides=args.overrides, File "/home/x/anaconda3/envs/x/lib/python3.6/site-packages/hydra/_internal/hydra.py", line 112, in run configure_logging=with_log_configuration, File "/home/x/anaconda3/envs/x/lib/python3.6/site-packages/hydra/core/utils.py", line 125, in run_job ret.return_value = task_function(task_cfg) File "x/asr/attempt.py", line 53, in tune "lr": tune.loguniform(1e-4, 1e-1), AttributeError: 'function' object has no attribute 'loguniform'

richardliaw commented 3 years ago

Hey! I posted a new function to run in the previous message. Could you try that again?

On Tue, Jul 27, 2021 at 3:35 AM Amels404 @.***> wrote:

Yes, sure. You can find also the print results of the vars and dir, I'm not sure if this might be useful. Thanks!

<function tune at 0x7fd826b67d90> {'wrapped': <function tune at 0x7fd826b67d08>} ['annotations', 'call', 'class', 'closure', 'code', 'defaults', 'delattr', 'dict', 'dir', 'doc', 'eq', 'format', 'ge', 'get', 'getattribute', 'globals', 'gt', 'hash', 'init', 'init_subclass', 'kwdefaults', 'le', 'lt', 'module', 'name', 'ne', 'new', 'qualname', 'reduce', 'reduce_ex', 'repr', 'setattr', 'sizeof', 'str', 'subclasshook', 'wrapped'] {} Traceback (most recent call last): File "aaj/asr/attempt.py", line 59, in tune() File "/home/x/anaconda3/envs/x/lib/python3.6/site-packages/nemo/core/config/hydra_runner.py", line 103, in wrapper strict=None, File "/home/x/anaconda3/envs/x/lib/python3.6/site-packages/hydra/_internal/utils.py", line 347, in _run_hydra lambda: hydra.run( File "/home/x/anaconda3/envs/x/lib/python3.6/site-packages/hydra/_internal/utils.py", line 201, in run_and_report raise ex File "/home/x/anaconda3/envs/x/lib/python3.6/site-packages/hydra/_internal/utils.py", line 198, in run_and_report return func() File "/home/x/anaconda3/envs/x/lib/python3.6/site-packages/hydra/_internal/utils.py", line 350, in overrides=args.overrides, File "/home/x/anaconda3/envs/x/lib/python3.6/site-packages/hydra/_internal/hydra.py", line 112, in run configure_logging=with_log_configuration, File "/home/x/anaconda3/envs/x/lib/python3.6/site-packages/hydra/core/utils.py", line 125, in run_job ret.return_value = task_function(task_cfg) File "x/asr/attempt.py", line 53, in tune "lr": tune.loguniform(1e-4, 1e-1), AttributeError: 'function' object has no attribute 'loguniform'

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/16878#issuecomment-887403144, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABCRZZMH36WW55K7XWMF7BLTZ2DYRANCNFSM473DTFNA .

Amels404 commented 3 years ago

Sure! I only omitted the config part. thanks!

<module 'ray.tune' from '/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/ray/tune/__init__.py'> 2021-07-27 17:20:49,778 INFO services.py:1274 -- View the Ray dashboard at http://127.0.0.1:8265 ************************************************************************************************------------------------------------ 2021-07-27 17:20:51,639 WARNING function_runner.py:546 -- Function checkpointing is disabled. This may result in unexpected behavior when using checkpointing features or certain schedulers. To enable, set the train function arguments to befunc(config, checkpoint_dir=None). 2021-07-27 17:20:51,783 WARNING tune.py:494 -- Tune detects GPUs, but no trials are using GPUs. To enable trials to use GPUs, set tune.run(resources_per_trial={'gpu': 1}...) which allows Tune to expose 1 GPU to each trial. You can also overrideTrainable.default_resource_request` if using the Trainable API. == Status == Memory usage on this node: 3.4/15.6 GiB Using FIFO scheduling algorithm. Resources requested: 0/12 CPUs, 0/2 GPUs, 0.0/8.05 GiB heap, 0.0/4.02 GiB objects (0.0/1.0 accelerator_type:GTX) Result logdir: /home/amel/ray_results/train_2021-07-27_17-20-51 Number of trials: 1/1 (1 PENDING) +-------------------+----------+-------+--------------+------------+ | Trial name | status | loc | batch_size | lr | |-------------------+----------+-------+--------------+------------| | train_9c4c3_00000 | PENDING | | 128 | 0.00179969 | +-------------------+----------+-------+--------------+------------+ (pid=19669) [NeMo W 2021-07-27 17:20:54 experimental:28] Module <class 'nemo.collections.asr.data.audio_to_text_dali.AudioToCharDALIDataset'> is experimental, not ready for production and is not fully supported. Use at your own risk. (pid=19669) ################################################################################ (pid=19669) ### WARNING, path does not exist: KALDI_ROOT=/mnt/matylda5/iveselyk/Tools/kaldi-trunk (pid=19669) ### (please add 'export KALDI_ROOT=' in your $HOME/.profile) (pid=19669) ### (or run as: KALDI_ROOT= python .py) (pid=19669) ################################################################################ (pid=19669) (pid=19669) [NeMo W 2021-07-27 17:20:54 nemo_logging:349] /home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/torchaudio/backend/utils.py:54: UserWarning: "sox" backend is being deprecated. The default backend will be changed to "sox_io" backend in 0.8.0 and "sox" backend will be removed in 0.9.0. Please migrate to "sox_io" backend. Please refer to https://github.com/pytorch/audio/issues/903 for the detail. (pid=19669) '"sox" backend is being deprecated. ' (pid=19669) (pid=19669) Hydra config: name:

(pid=19669) 2021-07-27 17:20:54,823 ERROR function_runner.py:254 -- Runner Thread raised error. (pid=19669) Traceback (most recent call last): (pid=19669) File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/ray/tune/function_runner.py", line 248, in run (pid=19669) self._entrypoint() (pid=19669) File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/ray/tune/function_runner.py", line 316, in entrypoint (pid=19669) self._status_reporter.get_checkpoint()) (pid=19669) File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/ray/tune/function_runner.py", line 581, in _trainable_func (pid=19669) output = fn() (pid=19669) File "aaj/asr/attempt.py", line 41, in train (pid=19669) model = create_nemo_model(config["hydraconfig"]) (pid=19669) File "aaj/asr/attempt.py", line 36, in create_nemo_model (pid=19669) callbacks=callbacks) (pid=19669) File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 41, in overwrite_by_env_vars (pid=19669) return fn(self, kwargs) (pid=19669) File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 359, in init (pid=19669) deterministic, (pid=19669) File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator_connector.py", line 101, in on_trainer_init (pid=19669) self.trainer.data_parallel_device_ids = device_parser.parse_gpu_ids(self.trainer.gpus) (pid=19669) File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/pytorch_lightning/utilities/device_parser.py", line 78, in parse_gpu_ids (pid=19669) gpus = _sanitize_gpu_ids(gpus) (pid=19669) File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/pytorch_lightning/utilities/device_parser.py", line 142, in _sanitize_gpu_ids (pid=19669) """) (pid=19669) pytorch_lightning.utilities.exceptions.MisconfigurationException: (pid=19669) You requested GPUs: [0] (pid=19669) But your machine only has: [] (pid=19669) (pid=19669) Exception in thread Thread-2: (pid=19669) Traceback (most recent call last): (pid=19669) File "/home/amel/anaconda3/envs/aaj/lib/python3.6/threading.py", line 916, in _bootstrap_inner (pid=19669) self.run() (pid=19669) File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/ray/tune/function_runner.py", line 267, in run (pid=19669) raise e (pid=19669) File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/ray/tune/function_runner.py", line 248, in run (pid=19669) self._entrypoint() (pid=19669) File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/ray/tune/function_runner.py", line 316, in entrypoint (pid=19669) self._status_reporter.get_checkpoint()) (pid=19669) File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/ray/tune/function_runner.py", line 581, in _trainable_func (pid=19669) output = fn() (pid=19669) File "aaj/asr/attempt.py", line 41, in train (pid=19669) model = create_nemo_model(config["hydraconfig"]) (pid=19669) File "aaj/asr/attempt.py", line 36, in create_nemo_model (pid=19669) callbacks=callbacks) (pid=19669) File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 41, in overwrite_by_env_vars (pid=19669) return fn(self, kwargs) (pid=19669) File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 359, in init (pid=19669) deterministic, (pid=19669) File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator_connector.py", line 101, in on_trainer_init (pid=19669) self.trainer.data_parallel_device_ids = device_parser.parse_gpu_ids(self.trainer.gpus) (pid=19669) File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/pytorch_lightning/utilities/device_parser.py", line 78, in parse_gpu_ids (pid=19669) gpus = _sanitize_gpu_ids(gpus) (pid=19669) File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/pytorch_lightning/utilities/device_parser.py", line 142, in _sanitize_gpu_ids (pid=19669) """) (pid=19669) pytorch_lightning.utilities.exceptions.MisconfigurationException: (pid=19669) You requested GPUs: [0] (pid=19669) But your machine only has: [] (pid=19669) (pid=19669) 2021-07-27 17:20:54,957 ERROR trial_runner.py:748 -- Trial train_9c4c3_00000: Error processing event. Traceback (most recent call last): File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/ray/tune/trial_runner.py", line 718, in _process_trial results = self.trial_executor.fetch_result(trial) File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/ray/tune/ray_trial_executor.py", line 688, in fetch_result result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT) File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/ray/_private/client_mode_hook.py", line 62, in wrapper return func(*args, *kwargs) File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/ray/worker.py", line 1494, in get raise value.as_instanceof_cause() ray.exceptions.RayTaskError(TuneError): ray::ImplicitFunc.train_buffered() (pid=19669, ip=192.168.88.54) File "python/ray/_raylet.pyx", line 501, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 451, in ray._raylet.execute_task.function_executor File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/ray/_private/function_manager.py", line 563, in actor_method_executor return method(__ray_actor, args, kwargs) File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/ray/tune/trainable.py", line 173, in train_buffered result = self.train() File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/ray/tune/trainable.py", line 232, in train result = self.step() File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/ray/tune/function_runner.py", line 366, in step self._report_thread_runner_error(block=True) File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/ray/tune/function_runner.py", line 513, in _report_thread_runner_error ("Trial raised an exception. Traceback:\n{}".format(err_tb_str) ray.tune.error.TuneError: Trial raised an exception. Traceback: ray::ImplicitFunc.train_buffered() (pid=19669, ip=192.168.88.54) File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/ray/tune/function_runner.py", line 248, in run self._entrypoint() File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/ray/tune/function_runner.py", line 316, in entrypoint self._status_reporter.get_checkpoint()) File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/ray/tune/function_runner.py", line 581, in _trainable_func output = fn() File "aaj/asr/attempt.py", line 41, in train model = create_nemo_model(config["hydraconfig"]) File "aaj/asr/attempt.py", line 36, in create_nemo_model callbacks=callbacks) File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/pytorch_lightning/trainer/connectors/env_vars_connector.py", line 41, in overwrite_by_env_vars return fn(self, kwargs) File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 359, in init deterministic, File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator_connector.py", line 101, in on_trainer_init self.trainer.data_parallel_device_ids = device_parser.parse_gpu_ids(self.trainer.gpus) File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/pytorch_lightning/utilities/device_parser.py", line 78, in parse_gpu_ids gpus = _sanitize_gpu_ids(gpus) File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/pytorch_lightning/utilities/device_parser.py", line 142, in _sanitize_gpu_ids """) pytorch_lightning.utilities.exceptions.MisconfigurationException: You requested GPUs: [0] But your machine only has: [] Result for train_9c4c3_00000: {} == Status == Memory usage on this node: 3.7/15.6 GiB Using FIFO scheduling algorithm. Resources requested: 0/12 CPUs, 0/2 GPUs, 0.0/8.05 GiB heap, 0.0/4.02 GiB objects (0.0/1.0 accelerator_type:GTX) Result logdir: /home/amel/ray_results/train_2021-07-27_17-20-51 Number of trials: 1/1 (1 ERROR) +-------------------+----------+-------+--------------+------------+ | Trial name | status | loc | batch_size | lr | |-------------------+----------+-------+--------------+------------| | train_9c4c3_00000 | ERROR | | 128 | 0.00179969 | +-------------------+----------+-------+--------------+------------+ Number of errored trials: 1 +-------------------+--------------+--------------------------------------------------------------------------------------------------------------------------------+ | Trial name | # failures | error file | |-------------------+--------------+--------------------------------------------------------------------------------------------------------------------------------| | train_9c4c3_00000 | 1 | /home/amel/ray_results/train_2021-07-27_17-20-51/train_9c4c3_00000_0_batch_size=128,lr=0.0017997_2021-07-27_17-20-51/error.txt | +-------------------+--------------+--------------------------------------------------------------------------------------------------------------------------------+

Traceback (most recent call last): File "aaj/asr/attempt.py", line 55, in tune_function() File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/nemo/core/config/hydra_runner.py", line 103, in wrapper strict=None, File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/hydra/_internal/utils.py", line 347, in _run_hydra lambda: hydra.run( File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/hydra/_internal/utils.py", line 201, in run_and_report raise ex File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/hydra/_internal/utils.py", line 198, in run_and_report return func() File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/hydra/_internal/utils.py", line 350, in overrides=args.overrides, File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/hydra/_internal/hydra.py", line 112, in run configure_logging=with_log_configuration, File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/hydra/core/utils.py", line 125, in run_job ret.return_value = task_function(task_cfg) File "aaj/asr/attempt.py", line 53, in tune_function ray.tune.run(train, config=config) File "/home/amel/anaconda3/envs/aaj/lib/python3.6/site-packages/ray/tune/tune.py", line 543, in run raise TuneError("Trials did not complete", incomplete_trials) ray.tune.error.TuneError: ('Trials did not complete', [train_1ce37_00000])
`

richardliaw commented 3 years ago

Looks like great progress! We're quite close:

def create_nemo_model(cfg):
    logging.info(f"Hydra config: {OmegaConf.to_yaml(cfg)}")
    callbacks = [
        TuneReportCallback(
            {"wer": "val_wer"}, on="validation_end"
        )
    ]
    trainer = pl.Trainer(**cfg.trainer,
        callbacks=callbacks)
    asr_model = EncDecCTCModel(cfg=cfg.model, trainer=trainer)
    return trainer.fit(asr_model)

def train(config):
    model = create_nemo_model(config["hydraconfig"])
    return model.train()

@hydra_runner(config_path="configs", config_name=CONFIG_NAME)
def tune_function(cfg):
    # cfg is a omegaconf object?
    config = {
        "lr": ray.tune.loguniform(1e-4, 1e-1),
        "batch_size": ray.tune.choice([32, 64, 128]),
    }
    config["hydraconfig"] = cfg
    ray.tune.run(train, config=config, resources_per_trial={"gpu": 1})

try the above? Added a gpu line on the bottom.

Amels404 commented 3 years ago

It working now! I just needed to return the model instead of the model.train Thank you very much!

richardliaw commented 3 years ago

Closing this, as it looks like the workload is good.