[rllib] Ape-X multiple VMs training issue

ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.

https://ray.io

Apache License 2.0

33.11k stars 5.6k forks source link

[rllib] Ape-X multiple VMs training issue #2647

Closed whikwon closed 6 years ago

whikwon commented 6 years ago

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): ubuntu 16.04
Ray installed from (source or binary): source
Ray version: 0.5.0
Python version: 3.6.3
Exact command to reproduce:

Describe the problem

I made a cluster using 3 Azure VMs and ran the basic code of Ape-X DDPG. As soon as I execute the code, All CPUs start and stop immediately with no errors.

Actually, when I ran the same code in one VM, it worked.

Source code / logs

import ray
import ray.rllib.agents.ddpg as ddpg

ray.init(<head-ip:port>)
agent = ddpg.ApexDDPGAgent(config={'num_workers': 145,}, env='Pendulum-v0')

for i in range(1000):
    result = agent.train()
    print('result: {}'.format(result))

    if i % 100 == 0:
        checkpoint = agent.save()

richardliaw commented 6 years ago

Hi @whikwon,

Can you verify that the ray cluster has actually been created? You can do this by posting the output of ray.global_state.client_table().

Also, can you clarify by what you mean when the CPUs start and stop?

This would help debug the issue. Thanks!

whikwon commented 6 years ago

@richardliaw Thanks for the quick response.

I think cluster has been created. (picture below)

I'm monitoring cpu usage using top in console. when i execute the code above, all cpu go to 100% in a second and being dead.

Anyway, all cpu don't work at all but code i executed doesn't generate any error.

richardliaw commented 6 years ago

No problem - does the code simply return? Or does it hang?

It's possible that it's starting and waiting for actors (which may take some time initially).

Also, it would be good to set OMP_NUM_THREADS as an environment variable, or else you could hit some OS thread limit.

One place to check for errors is to tail /tmp/raylogs/worker*.

whikwon commented 6 years ago

Just hang. I thought it would take a while but status doesn't change.

where should i have to give the OMP_NUM_THREADS?

richardliaw commented 6 years ago

Perhaps run something like export OMP_NUM_THREADS=2 on each machine before initializing Ray (or after ray stop and initializing Ray again).

On Mon, Aug 13, 2018 at 10:01 PM Whi Kwon notifications@github.com wrote:

Just hang. I thought it would take a while but status doesn't change.

[image: image] https://user-images.githubusercontent.com/30768596/44072525-4c76660a-9fca-11e8-9e70-bb0af21dd9f2.png

where should i have to give the OMP_NUM_THREADS?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/2647#issuecomment-412755178, or mute the thread https://github.com/notifications/unsubscribe-auth/AEUc5R2Uxy8s-6KkbBKO3Q_CNMNf7xhxks5uQlmpgaJpZM4V7wzd .

whikwon commented 6 years ago

Thanks. I tried but it doesn't work. OMG

richardliaw commented 6 years ago

Sorry to hear. What are the contents of the worker logs? tail /tmp/raylogs/worker*

ericl commented 6 years ago

It is possible you are running into this issue: https://github.com/ray-project/ray/issues/2541

One work around is to avoid launching that many workers, so if you have N virtual CPUs, use N/2 workers to avoid completely filling up the cluster. I usually find no performance benefit from using more than N/2 workers anyways, since model infererence is very floating point intensive and doesn't benefit from hyperthreading.

whikwon commented 6 years ago

@richardliaw Nothing special, tensorflow and gym warning.. something like that.

whikwon commented 6 years ago

@ericl i had no problem running N virtual CPUs in one Azure Virtual Machine.

Now, I'm using M virtual and tried M*N/2 workers but same situation happens. (No warning, No CPU usage. Just hanging.)

ericl commented 6 years ago

Do you get the hang with a very small number of workers, similar or less than that for a single machine?

Trying to figure out if this is a multi node issue.

ericl commented 6 years ago

Oh btw, I would also switch to running the agent with Tune (see rllib training api doc for usage). That usually takes care of subtle resource allocation issues, and will print out the current usage as well.

ericl commented 6 years ago

https://github.com/ray-project/ray/pull/2661

ericl commented 6 years ago

Should be fixed -- feel free to reopen if it happens still.