Neo4j Docker container hangs after amount of time

quentinsch commented 7 years ago

On our Docker swarm cluster Neo4j hangs once in a while and Neo4j stops functioning. The Java process on the Docker host hangs and cannot be killed either by Docker or manually on the shell. The Docker host is a physical server with CentOS 7.3.1611 as OS with Docker 1.12.5 on top of it. Neo4j is running in an Alpine Linux 3.5.1 container with OpenJDK v8. The actual database directory is located on a GlusterFS 3.8.5 FUSE-mount. I attached the log output which is send by the container to the syslog server for reference.

Neo4j Version: 3.1.1 Operating System: Alpine Linux 3.5.1 API: Docker 1.12.5 --> now updated to 1.13-1

Steps to reproduce

The image runs fine for a couple of days
The container hangs and stops functioning
Docker swarm tries to relocate the container, but it is 'stuck' --> doesn't happen anymore on Ceph
Docker cannot stop the container --> doesn't happen anymore on Ceph
Docker cannot kill the container --> doesn't happen anymore on Ceph
The OS cannot kill the container --> doesn't happen anymore on Ceph
When restarting the container it works for an amount of time.

Expected behavior

Just keep the container running and be functional

Actual behavior

The container stalls/hangs after some amount of time (can be different, most of the time a couple of days) and the actual Java process ont the host cannot be stopped or killed anymore.

neo-crashlog.zip

chrisvest commented 7 years ago

Can other programs within the container access files in the store directory? Since you say you cannot even kill (I assume kill -9) the database anymore, my suspicion is that it gets stuck in kernel land, and probably somewhere to do with that FUSE mounted GlusterFS.

quentinsch commented 7 years ago

The container has a mapped volume, so in theory yes, other containers/programs could access the store directory but the current container should be only one accessing it (of course, otherwise you get locking issues). And indeed, kill -9 doesn't help anymore. It has kind of deadlock on the process. I'll look into the FUSE mount, but can it be assumed that Gluster mounts are not supported? Then another question arises: How do you run Neo in a Docker swarm (since Docker is supported) but be sure it can access the store?

chrisvest commented 7 years ago

Our documentation only mention local volumes. We can't test all possible combinations of OS, Docker and file system setups. And networked file systems are not optimal for databases in general anyway.

quentinsch commented 7 years ago

I totally understand, but is there a supported way how to run Neo in Docker swarm?

chrisvest commented 7 years ago

I don't really know what Docker swarm is, and Docker itself is already in "adventurous territories" as far as I understand, so I don't think we can say that there is a supported way, but there might be a way… somewhere.

spacecowboy commented 7 years ago

There shouldn't be any issues running with Docker Swarm. You need to use a Docker Network to make use of HA but that's probably what you do with Swarm anyway.

Running with GlusterFS is a totally different story though and unrelated to Docker per se.

About the container hanging @quentinsch , you have verified that the machine didn't run out of RAM and started making use of Swap memory? That could have exactly that effect.

quentinsch commented 7 years ago

Thanks for the feedback so far! If you use Docker Swarm with the Docker 'service' way of starting the containers you'll need some storage which goes along with the container when it is moved to another node. In this case it is done with GlusterFS, but that seems to cause some troubles. I'm investigating if another type of distributed storage can be used.

Regarding the memory @spacecowboy, I don't think the container runs out of memory as, in this case, there are no restrictions applied for this container and the host has a lot of memory (+100GB). But thanks for mentioning it!

spacecowboy commented 7 years ago

@quentinsch if the machine really has 100GB+ RAM you need to configure the memory settings of Neo4j. Do NOT let it allocate the default amount of Head space and page cache.

Are you running Neo4j in HA mode? Or are you using CausalClustering?

quentinsch commented 7 years ago

@spacecowboy Thanks for pointing out to that, I'll limit the memory usage and see if that will improve the stability. Neo is running in HA mode by the way.

quentinsch commented 7 years ago

Ok, I capped the neo container to a limited amount of memory, this didn't help. Further I replaced the distributed storage from Gluster to Ceph, which didn't help either. Also Docker is updated to 1.13-1. I'm still investigating this issue, because once in a while neo keeps crashing at random times.

chrisvest commented 7 years ago

Do you also cap the virtual memory in the container? It is known that the JVM has bugs (or problematic assumptions) in its view of how much memory it can be provided with, when it runs in a docker container.

neo4j / neo4j