"java.lang.OutOfMemoryError: Unable to create new native thread" using many nodes per host

lpellegr commented 9 years ago

While running benchmarks on EC2 it was noticed that having too many ProActive nodes running tasks may cause java.lang.OutOfMemoryError: Unable to create new native thread on nodes.

Each machine running ProActive nodes had the following configuration:

- CPU with 36 cores
- 180 PA nodes
- Nodes deployed using property
- Property `-Dproactive.node.workers=180` for using single JVM feature

When getting the error, each node was in the following state:

- Running 180 tasks
- Each task was a Java task
- Java task content is `Thread.sleep(600000)` (10 minutes)

It was possible to reproduce the issue with a single machine hosting 180 PA nodes. However, when the issue occurs all machine resources are exhausted and it not even possible to connect to it in order to retrieve logs, analyze load, count number of processes, etc.:

lpellegr@oops:benchmarks-ec2$ ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -i ~/.ssh/activeeon-ec2.pem ec2-user@52.29.86.126
Warning: Permanently added '52.29.86.126' (ECDSA) to the list of known hosts.
Last login: Wed Oct 21 12:12:12 2015 from activeeon.pck.nerim.net
-bash: fork: retry: Aucun processus enfant
-bash: fork: retry: Aucun processus enfant
...

The issue is related to the number of threads which are created by nodes. It is important to notice that the exact same configuration as before using a sleep timeout of 25 seconds for Java tasks prevents the issue to occur.

The previous observation let me think the issue could be related with pings which are performed periodically (each 20 seconds). Although the single JVM feature spawns only one JVM for multiple PA nodes, then one ping is done for each PA node running on the host (more exactly each TaskLauncher active object per node)

Below are some pointers to related classes:

Regarding the item TaskLauncher#getProgress, the method is defined with @ImmediateService which implies that a new thread will be created for each invokation, which may explain threads exhaustion since there is no upper bound for immediate services.

I think it could be interesting to investigate deeper the issue. One possibility is to increase the ping period to see if the issue comes from that. If it is the case, a solution or even general improvement could be to see if the scheduler behaviour could be changed in order to send a single ping request to an host even if it contains several ProActive nodes.

fviale commented 9 years ago

Usually this error is due to the linux configuration limitations:

You can try to set the following values in /etc/security/limits.conf (example config for RedHat): username hard nofile 65535 username soft nofile 65535 username soft nproc 65535 username hard nproc 65535 If you see files in /etc/security/limits.d/ folder, you might need to port the settings to these files too.

Though what you experience in the machine after the error occurs is a bit weird, the limitation should affect only one user, but it may make sense, if you are using the same user for the scheduler and for bash session. Did you try to connect with another user to see if there is a difference ?

lpellegr commented 9 years ago

@fviale Yes, something similar was used:

/etc/security/limits.conf
ec2-user         soft     nofile          81920
ec2-user         hard    nofile          81920
ec2-user         soft     nproc         81920
ec2-user         hard    nproc         81920

However, I think it's more a workaround since the same configuration (I mean number of PA nodes and tasks submitted) but a different sleep timeout value in task code (reduced to 25 seconds) does not make exceed the default limit. This last observation let me think it is more an application problem than a system configuration issue.

fviale commented 9 years ago

ok, in that case and if you experience a global problem on the system afterwards, yes it's a different scenario. At the same time, I find it weird to deploy 180 workers on a 36 core machine. It's kind of logical we hit the ceiling there.

With what you last said, it would seem that the number of thread grows with the tasks life, which could be related to the getProgress threads as you said (which I naively thought were short-lived).

The number of transfer Threads is set to 5 (so it will be 5 * 180), maybe it should be configurable.

Do you use forked tasks (this adds another multiplier) ?

tobwiens commented 9 years ago

Why do you think is the thread limit with a sleep 25 not exhausted? Is it possible, that not all task were started before the first sleep 25 returned? Meaning that the scheduling time was longer than the actual execution time? Therefore threads might have been destroyed at some point and therefore the issue did not appear with 25 seconds timeout?

If not? How is it possible to crash or not crash with the same amount of threads and same amount of memory used?

Maybe we reached a limit there, which is noteworthy as a limit but not necessarily needed to be increased.

lpellegr commented 9 years ago

@fviale yes, tasks are forked. I agree that deploying 180 PA nodes is really weird but I think this is another problem. As you said, we should investigate if the issue is really related to getProgress or not.

@tobias I will try to answer all your questions below:

Why do you think is the thread limit with a sleep 25 not exhausted?

We have run a benchmark with the exact same configuration as described before and with sleep 25 without getting any error.

Is it possible, that not all task were started before the first sleep 25 returned?

No, PA nodes usage has grown up to 100%.

How is it possible to crash or not crash with the same amount of threads and same amount of memory used?

That's what I tried to explain before. Tasks running on PA nodes are monitored by calling getProgress each 20 seconds. Since this remote call is a ProActive method call annotated with @ImmediateService, a new thread is created at each invocation. Although these calls are supposed short-lived, they may take time to complete if the machine is overloaded. Thus, if a getProgress invocation takes more than 20 seconds to complete, new calls will spawn new threads, which may explain the limit that is reached.

fviale commented 9 years ago

@lpellegr , I think you analysed correctly the root of the problem. I suppose the getProgress period is equal to the pa.scheduler.core.nodepingfrequency configuration. Which means it could be increased for long running tasks (and overloaded machine).

Additionnally, the getProgress period could be dynamic and adjusted to the time needed by the worker to answer. For example, if the time needed to answer a getProgress call was 10s, it seems not reasonable to send another request 10s after. I think a min of (20x(response_time), nodepingfrequency) could be okay.

ow2-proactive / scheduling

"java.lang.OutOfMemoryError: Unable to create new native thread" using many nodes per host #2364