Closed whikwon closed 6 years ago
Hi @whikwon,
Can you verify that the ray cluster has actually been created? You can do this by posting the output of ray.global_state.client_table()
.
Also, can you clarify by what you mean when the CPUs start and stop?
This would help debug the issue. Thanks!
@richardliaw Thanks for the quick response.
I think cluster has been created. (picture below)
I'm monitoring cpu usage using top
in console. when i execute the code above, all cpu go to 100% in a second and being dead.
Anyway, all cpu don't work at all but code i executed doesn't generate any error.
No problem - does the code simply return? Or does it hang?
It's possible that it's starting and waiting for actors (which may take some time initially).
Also, it would be good to set OMP_NUM_THREADS
as an environment variable, or else you could hit some OS thread limit.
One place to check for errors is to tail /tmp/raylogs/worker*
.
Just hang. I thought it would take a while but status doesn't change.
where should i have to give the OMP_NUM_THREADS
?
Perhaps run something like export OMP_NUM_THREADS=2
on each machine
before initializing Ray (or after ray stop and initializing Ray again).
On Mon, Aug 13, 2018 at 10:01 PM Whi Kwon notifications@github.com wrote:
Just hang. I thought it would take a while but status doesn't change.
[image: image] https://user-images.githubusercontent.com/30768596/44072525-4c76660a-9fca-11e8-9e70-bb0af21dd9f2.png
where should i have to give the OMP_NUM_THREADS?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ray-project/ray/issues/2647#issuecomment-412755178, or mute the thread https://github.com/notifications/unsubscribe-auth/AEUc5R2Uxy8s-6KkbBKO3Q_CNMNf7xhxks5uQlmpgaJpZM4V7wzd .
Thanks. I tried but it doesn't work. OMG
Sorry to hear. What are the contents of the worker logs? tail /tmp/raylogs/worker*
It is possible you are running into this issue: https://github.com/ray-project/ray/issues/2541
One work around is to avoid launching that many workers, so if you have N virtual CPUs, use N/2 workers to avoid completely filling up the cluster. I usually find no performance benefit from using more than N/2 workers anyways, since model infererence is very floating point intensive and doesn't benefit from hyperthreading.
@richardliaw Nothing special, tensorflow and gym warning.. something like that.
@ericl i had no problem running N virtual CPUs in one Azure Virtual Machine.
Now, I'm using M virtual and tried M*N/2 workers but same situation happens. (No warning, No CPU usage. Just hanging.)
Do you get the hang with a very small number of workers, similar or less than that for a single machine?
Trying to figure out if this is a multi node issue.
Oh btw, I would also switch to running the agent with Tune (see rllib training api doc for usage). That usually takes care of subtle resource allocation issues, and will print out the current usage as well.
Should be fixed -- feel free to reopen if it happens still.
System information
Describe the problem
I made a cluster using 3 Azure VMs and ran the basic code of Ape-X DDPG. As soon as I execute the code, All CPUs start and stop immediately with no errors.
Actually, when I ran the same code in one VM, it worked.
Source code / logs