ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.36k stars 5.65k forks source link

Profiling for Ray - error #4052

Closed Anya28 closed 4 years ago

Anya28 commented 5 years ago

System information

Describe the problem

I have been trying to use profiling with ray, for which I have installed line_profiler. If I try to run the code as specified in https://ray.readthedocs.io/en/latest/user-profiling.html, I keep on getting the NameError: name 'profile' is not defined. However, if I try to run something outside of ray, then I never get this error for line_profiler, and it works perfectly. Has anyone come across this issue?

Source code / logs

Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-02-14_12-08-13_10898/logs.
Waiting for redis server at 127.0.0.1:13695 to respond...
Waiting for redis server at 127.0.0.1:19959 to respond...
Starting Redis shard with 10.0 GB max memory.
Warning: Capping object memory store to 20.0GB. To increase this further, specify `object_store_memory` when calling ray.init() or ray start.
Starting the Plasma object store with 20.0 GB memory using /dev/shm.

======================================================================
View the web UI at http://localhost:8889/notebooks/ray_ui.ipynb?token=07cb5b439dd96c934725214697e0ef02de1872ad3d26fc2b
======================================================================

2019-02-14 12:08:15,340 INFO tune.py:134 -- Tip: to resume incomplete experiments, pass resume='prompt' or resume=True to run_experiments()
2019-02-14 12:08:15,342 INFO tune.py:144 -- Starting a new experiment.
== Status ==
Using FIFO scheduling algorithm.
Resources requested: 0/20 CPUs, 0/0 GPUs
Memory usage on this node: 66.2/137.1 GB

== Status ==
Using FIFO scheduling algorithm.
Resources requested: 3/20 CPUs, 0/0 GPUs
Memory usage on this node: 66.3/137.1 GB
Result logdir: /home/am9114/ray_results/default
RUNNING trials:
 - A3C_SpaceInvadersNoFrameskip-v4_0:   RUNNING

Traceback (most recent call last):
  File "/home/am9114/ray/python/ray/function_manager.py", line 635, in fetch_and_register_actor
    unpickled_class = pickle.loads(pickled_class)
  File "/home/am9114/ray/python/ray/rllib/__init__.py", line 44, in <module>
    _register_all()
  File "/home/am9114/ray/python/ray/rllib/__init__.py", line 35, in _register_all
    from ray.rllib.agents.registry import ALGORITHMS
  File "/home/am9114/ray/python/ray/rllib/agents/__init__.py", line 1, in <module>
    from ray.rllib.agents.agent import Agent, with_common_config
  File "/home/am9114/ray/python/ray/rllib/agents/agent.py", line 20, in <module>
    from ray.rllib.optimizers.policy_optimizer import PolicyOptimizer
  File "/home/am9114/ray/python/ray/rllib/optimizers/__init__.py", line 4, in <module>
    from ray.rllib.optimizers.async_gradients_optimizer import \
  File "/home/am9114/ray/python/ray/rllib/optimizers/async_gradients_optimizer.py", line 11, in <module>
    class AsyncGradientsOptimizer(PolicyOptimizer):
  File "/home/am9114/ray/python/ray/rllib/optimizers/async_gradients_optimizer.py", line 30, in AsyncGradientsOptimizer
    @profile
NameError: name 'profile' is not defined

2019-02-14 12:08:20,539 ERROR trial_runner.py:412 -- Error processing event.
Traceback (most recent call last):
  File "/home/am9114/ray/python/ray/tune/trial_runner.py", line 378, in _process_events
    result = self.trial_executor.fetch_result(trial)
  File "/home/am9114/ray/python/ray/tune/ray_trial_executor.py", line 228, in fetch_result
    result = ray.get(trial_future[0])
  File "/home/am9114/ray/python/ray/worker.py", line 2210, in get
    raise value
ray.worker.RayTaskError: ray_TemporaryActor:train() (pid=10989, host=MSRC-4079441)
Exception: The actor with name A3CAgent failed to be imported, and so cannot execute this method

== Status ==
Using FIFO scheduling algorithm.
Resources requested: 0/20 CPUs, 0/0 GPUs
Memory usage on this node: 66.5/137.1 GB
Result logdir: /home/am9114/ray_results/default
ERROR trials:
 - A3C_SpaceInvadersNoFrameskip-v4_0:   ERROR, 1 failures: /home/am9114/ray_results/default/A3C_SpaceInvadersNoFrameskip-v4_0_2019-02-14_12-08-159_2rvtap/error_2019-02-14_12-08-20.txt

== Status ==
Using FIFO scheduling algorithm.
Resources requested: 0/20 CPUs, 0/0 GPUs
Memory usage on this node: 66.5/137.1 GB
Result logdir: /home/am9114/ray_results/default
ERROR trials:
 - A3C_SpaceInvadersNoFrameskip-v4_0:   ERROR, 1 failures: /home/am9114/ray_results/default/A3C_SpaceInvadersNoFrameskip-v4_0_2019-02-14_12-08-159_2rvtap/error_2019-02-14_12-08-20.txt

Wrote profile results to train.py.lprof
Traceback (most recent call last):
  File "/home/am9114/.local/bin/kernprof", line 11, in <module>
    load_entry_point('line-profiler==2.1.1', 'console_scripts', 'kernprof')()
  File "/home/am9114/.local/lib/python3.6/site-packages/kernprof.py", line 222, in main
    execfile(script_file, ns, ns)
  File "/home/am9114/.local/lib/python3.6/site-packages/kernprof.py", line 35, in execfile
    exec_(compile(f.read(), filename, 'exec'), globals, locals)
  File "train.py", line 158, in <module>
    run(args, parser)
  File "train.py", line 152, in run
    resume=args.resume)
  File "/home/am9114/ray/python/ray/tune/tune.py", line 185, in run_experiments
    raise TuneError("Trials did not complete", errored_trials)
ray.tune.error.TuneError: ('Trials did not complete', [A3C_SpaceInvadersNoFrameskip-v4_0])
PROFILE: interrupts/evictions/bytes = 0/0/64
stephenoffer commented 5 years ago

Since you didn't post the source code, I'm assuming that you tried something like this?

import ray
import ray.rllib.agents.a3c as a3c
from ray.tune.logger import pretty_print
import line_profiler

@profile
def main():
    ray.init()
    config = a3c.DEFAULT_CONFIG.copy()
    config["num_gpus"] = 0
    config["num_workers"] = 1
    agent = a3c.A3CAgent(config=config, env="SpaceInvaders-v4")

    for i in range(1000):
       result = agent.train()
       print(pretty_print(result))

       if i % 100 == 0:
           checkpoint = agent.save()
           print("checkpoint saved at", checkpoint)

if __name__ == "__main__":
    main()

And then you called it with:

kernprof -l rllib_profiler.py

For only 10 epochs in this example, it worked fine:

$ python -m line_profiler rllib_profiler.py.lprof 
Timer unit: 1e-06 s

Total time: 89.1993 s
File: rllib_profiler.py
Function: main at line 6

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     6                                           @profile
     7                                           def main():
     8         1     538095.0 538095.0      0.6      ray.init()
     9         1         55.0     55.0      0.0      config = a3c.DEFAULT_CONFIG.copy()
    10         1          2.0      2.0      0.0      config["num_gpus"] = 0
    11         1          1.0      1.0      0.0      config["num_workers"] = 1
    12         1    2102656.0 2102656.0      2.4     agent = a3c.A3CAgent(config=config, env="SpaceInvaders-v4")
    13                                           
    14        11         17.0      1.5      0.0      for i in range(10):
    15        10   86348015.0 8634801.5     96.8        result = agent.train()
    16        10     182952.0  18295.2      0.2         print(pretty_print(result))
    17                                           
    18        10         27.0      2.7      0.0         if i % 100 == 0:
    19         1      27420.0  27420.0      0.0             checkpoint = agent.save()
    20         1         37.0     37.0      0.0             print("checkpoint saved at", checkpoint)
heeryerate commented 4 years ago

Since you didn't post the source code, I'm assuming that you tried something like this?

import ray
import ray.rllib.agents.a3c as a3c
from ray.tune.logger import pretty_print
import line_profiler

@profile
def main():
    ray.init()
    config = a3c.DEFAULT_CONFIG.copy()
    config["num_gpus"] = 0
    config["num_workers"] = 1
    agent = a3c.A3CAgent(config=config, env="SpaceInvaders-v4")

    for i in range(1000):
       result = agent.train()
       print(pretty_print(result))

       if i % 100 == 0:
           checkpoint = agent.save()
           print("checkpoint saved at", checkpoint)

if __name__ == "__main__":
    main()

And then you called it with:

kernprof -l rllib_profiler.py

For only 10 epochs in this example, it worked fine:

$ python -m line_profiler rllib_profiler.py.lprof 
Timer unit: 1e-06 s

Total time: 89.1993 s
File: rllib_profiler.py
Function: main at line 6

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     6                                           @profile
     7                                           def main():
     8         1     538095.0 538095.0      0.6      ray.init()
     9         1         55.0     55.0      0.0      config = a3c.DEFAULT_CONFIG.copy()
    10         1          2.0      2.0      0.0      config["num_gpus"] = 0
    11         1          1.0      1.0      0.0      config["num_workers"] = 1
    12         1    2102656.0 2102656.0      2.4     agent = a3c.A3CAgent(config=config, env="SpaceInvaders-v4")
    13                                           
    14        11         17.0      1.5      0.0      for i in range(10):
    15        10   86348015.0 8634801.5     96.8        result = agent.train()
    16        10     182952.0  18295.2      0.2         print(pretty_print(result))
    17                                           
    18        10         27.0      2.7      0.0         if i % 100 == 0:
    19         1      27420.0  27420.0      0.0             checkpoint = agent.save()
    20         1         37.0     37.0      0.0             print("checkpoint saved at", checkpoint)

Hi @gndctrl2mjrtm Thanks for pointing out such example for profiling. In my case, I have my customized env and nn model, run through PPO. Is there a way to profile the code to understand how much time are spending on env interaction and how much time are spending on nn weights updates? I tried to put @profile to my step/reset/.. function inside my customEnv class, but it won't go through.

custom_env example at: https://github.com/ray-project/ray/blob/master/rllib/examples/custom_env.py

richardliaw commented 4 years ago

Consider not using the decorator but rather the cProfile python API for manual profiling.