Container won't start after initial run

maflynn commented 4 years ago

A little info on my setup... Have 3 Centos 7 hosts running the latest docker engine from the official docker repository. Running a mariadb galera cluster in containers across all 3 hosts as the shared database for the keycloak cluster. Using the latest keycloak image from jboss/keycloak:latest and the JDBC_PING mod to cluster.

my docker run syntax(ips and passwords removed): docker run -d -p 8443:8443 -p 7600:7600 -e KEYCLOAK_USER=admin -e KEYCLOAK_PASSWORD=$KC_PASS -e DB_VENDOR=mariadb -e DB_ADDR=$DB_IP -e DB_PORT=32775 -e DB_USER=keycloak -e DB_PASSWORD=$DB_PASS -e DB_DATABASE=keycloak -e JGROUPS_DISCOVERY_EXTERNAL_IP=$EXTERNAL_IP -e JGROUPS_DISCOVERY_PROTOCOL=JDBC_PING -e JGROUPS_DISCOVERY_PROPERTIES=datasource_jndi_name=java:jboss/datasources/KeycloakDS -v /etc/x509/https/tls.crt:/etc/x509/https/tls.crt -v /etc/x509/https/tls.key:/etc/x509/https/tls.key --name keycloak ivanfranchin/keycloak-clustered:latest

When I do the initial docker run command they all start fine. They connect to the database and register themselves in the JGROUPSPING table in the database. I can log in to each one individually and am able to see 3 different sessions being shared between all of them. Everything appears to be working correctly.

If I stop a container (docker stop keycloak) and try to restart it, it will not come back up.

The error i see from the docker logs is:

The batch failed with the following error: : WFLYCTL0062: Composite operation failed and was rolled back. Steps that failed: Step: step-9 Operation: /subsystem=datasources/jdbc-driver=mariadb:add(driver-name=mariadb, driver-module-name=org.mariadb.jdbc, driver-xa-datasource-class-name=org.mariadb.jdbc.MySQLDataSource) Failure: WFLYCTL0212: Duplicate resource [ ("subsystem" => "datasources"), ("jdbc-driver" => "mariadb") ]

It looks like Wildfly is building the mariadb datasource again but I don't know how that could even persist in the container after it's stopped.

If I delete the container(docker rm keycloak) and re-run it, it will start and re-join the cluster.

keithpl commented 4 years ago

i'm having the same exact issue, tried a few different tags as well

laszlomiklosik commented 4 years ago

I also encountered this WARNING (using postgres, but got a similar message logged). The problem was something else, this was just the last WARNING logged: it's also logged when all works correctly, even when you are not running Keycloak in a cluster.

It happens because the official Keycloak image recipe does not check if the driver was already registered or not, it simply tries to register it on each startup. See https://github.com/keycloak/keycloak-containers/blob/ccde71f8931ac0c1c216c9b1e61dfe326018a53b/server/tools/cli/databases/mariadb/change-database.cli

The startup issue might be due to a custom startup script for example (if you have one). If you try to configure a logger from a startup script and you are not checking if it exists already or not but already create it, you might get into a similar situation.

laszlomiklosik commented 4 years ago

Correction to my previous comment: indeed the JDBC_PING.cli script contains some commands which fail to execute when being re-ran (they actually run on each restart). If you remove lines:

/subsystem=jgroups/stack=udp:remove() and /socket-binding-group=standard-sockets/socket-binding=jgroups-mping:remove()

it will start up correctly subsequent times as well.

I tried to execute them conditionally using conditions like

if (outcome == success) of /subsystem=jgroups/stack=udp::read-resource()
  /subsystem=jgroups/stack=udp:remove()
end-if

but this did not work in this context. Another attempt was wrapping these 2 lines in a try-catch-end-try block, but that did not work either. Removing the 2 lines do not seem to cause harm.

ivangfr commented 4 years ago

Thanks @maflynn , @kplantjr and @laszlomiklosik for your comments.

I will try to have a look at it as soon as I get some free time.

ivangfr commented 4 years ago

Hi, I was able to reproduce the issue. Indeed, once we have, for instance, 2 running Keycloak instances using JDBC_PING discovery protocol and one of them restarts, this one cannot join again.

After several tests using MySQL, MariaDB and Postgres, I came to a conclusion (based on @laszlomiklosik suggestion, thanks for that) we don't need /subsystem=jgroups/stack=udp:remove() and /socket-binding-group=standard-sockets/socket-binding=jgroups-mping:remove(). I remove them and now everything looks ok. No problem on restarting anymore.

Besides, I realized that the command for creating the JGROUPSPING was correct for MySQL and MariaDB, but didn't work for Postgres. Because of that, I've created (in version 11.0.2) different JDBC_PING cli files for each database.

laszlomiklosik commented 4 years ago

Thanks for the update. The solution I came up with meanwhile:

embed-server --server-config=standalone-ha.xml --std-out=echo

if (outcome != success) of /subsystem=logging/logger=org.infinispan.CLUSTER:read-resource()
    /subsystem=logging/logger=org.infinispan.CLUSTER:add(level=INFO)
end-if

batch

/subsystem=infinispan/cache-container=keycloak/distributed-cache=sessions:write-attribute(name=owners, value=${env.CACHE_OWNERS:2})
/subsystem=infinispan/cache-container=keycloak/distributed-cache=authenticationSessions:write-attribute(name=owners, value=${env.CACHE_OWNERS:2})
/subsystem=infinispan/cache-container=keycloak/distributed-cache=offlineSessions:write-attribute(name=owners, value=${env.CACHE_OWNERS:2})
/subsystem=infinispan/cache-container=keycloak/distributed-cache=loginFailures:write-attribute(name=owners, value=${env.CACHE_OWNERS:2})
/subsystem=infinispan/cache-container=keycloak/distributed-cache=actionTokens:write-attribute(name=owners, value=${env.CACHE_OWNERS:2})
/subsystem=infinispan/cache-container=keycloak/distributed-cache=clientSessions:write-attribute(name=owners, value=${env.CACHE_OWNERS:2})
/subsystem=infinispan/cache-container=keycloak/distributed-cache=offlineClientSessions:write-attribute(name=owners, value=${env.CACHE_OWNERS:2})

/subsystem=jgroups/stack=tcp:remove()
/subsystem=jgroups/stack=tcp:add()
/subsystem=jgroups/stack=tcp/transport=TCP:add(socket-binding="jgroups-tcp")

/subsystem=jgroups/stack=tcp/protocol=JDBC_PING:add(add-index=0, properties=$keycloak_jgroups_discovery_protocol_properties)

/subsystem=jgroups/stack=tcp/protocol=MERGE3:add()
/subsystem=jgroups/stack=tcp/protocol=FD_SOCK:add(socket-binding="jgroups-tcp-fd")
/subsystem=jgroups/stack=tcp/protocol=FD:add()
/subsystem=jgroups/stack=tcp/protocol=VERIFY_SUSPECT:add()
/subsystem=jgroups/stack=tcp/protocol=pbcast.NAKACK2:add()
/subsystem=jgroups/stack=tcp/protocol=UNICAST3:add()
/subsystem=jgroups/stack=tcp/protocol=pbcast.STABLE:add()
/subsystem=jgroups/stack=tcp/protocol=pbcast.GMS:add()
/subsystem=jgroups/stack=tcp/protocol=pbcast.GMS/property=max_join_attempts:add(value=5)
/subsystem=jgroups/stack=tcp/protocol=MFC:add()
/subsystem=jgroups/stack=tcp/protocol=FRAG3:add()

/subsystem=jgroups/channel=ee:write-attribute(name=stack, value=tcp)

run-batch

if (outcome == success) of /subsystem=jgroups/stack=udp/protocol=PING:read-resource()
    /subsystem=jgroups/stack=udp/protocol=PING:remove()
end-if

if (outcome == success) of /subsystem=jgroups/stack=tcp/protocol=MPING:read-resource()
    /subsystem=jgroups/stack=tcp/protocol=MPING:remove()
end-if

try
    :resolve-expression(expression=${env.JGROUPS_DISCOVERY_EXTERNAL_IP})
    /subsystem=jgroups/stack=tcp/transport=TCP/property=external_addr/:add(value=${env.JGROUPS_DISCOVERY_EXTERNAL_IP})
catch
    echo "JGROUPS_DISCOVERY_EXTERNAL_IP maybe not set."
end-try

stop-embedded-server

This is the complete JDBC_PING.cli file which I mounted to the official Keycloak image. I took some inspiration from the original keycloak-containers/server/tools/cli/jgroups/discovery/default.cli (and server/tools/jgroups.sh) files which in theory supports JDBC_PING as well, but unfortunately only with local IPs (which does not make much sense as the default multicast ping would also work in case you run everything on the same machine/docker network): it doesn't support registering external IPs.

In docker I specify the following environment variables:

- JGROUPS_DISCOVERY_EXTERNAL_IP=external_ip_goes_here
- JGROUPS_DISCOVERY_PROTOCOL=JDBC_PING
- JGROUPS_DISCOVERY_PROPERTIES=datasource_jndi_name=java:jboss/datasources/KeycloakDS,info_writer_sleep_time=500,initialize_sql="CREATE TABLE IF NOT EXISTS JGROUPSPING ( own_addr varchar(200) NOT NULL, cluster_name varchar(200) NOT NULL, created timestamp default current_timestamp, ping_data BYTEA, constraint PK_JGROUPSPING PRIMARY KEY (own_addr, cluster_name))"

so the table creation query is configurable this way (so I expect that the same cli script works well for all db vendors).

ivangfr commented 4 years ago

Hey @laszlomiklosik , brilliant solution yours! I will adapt mine.

Btw, I am thinking about reducing a bit of the complexity by hiding the JGROUPSPING create table SQL, so that the final user doesn't need to provide it. My initial solution was a lazy one and I basically created scripts for (at least 3) DB vendors (it is missing oracle and mssql).

However, I believe it's possible to get the DB vendor (that was set as one of the parameters for the docker container, DB_VENDOR) inside JDBC_PING script.

When Keycloak starts, it checks the DB selected (https://github.com/keycloak/keycloak-containers/blob/master/server/tools/docker-entrypoint.sh#L231) and runs the change-database.sh script informing the DB vendor. Depending on the DB, it will run the /cli/databases/<DB>/change-database.cli script.

For instance, this is the script for MySQL https://github.com/keycloak/keycloak-containers/blob/master/server/tools/cli/databases/mysql/change-database.cli#L2 Maybe, by using read-resource() of the datasource KeycloakDS it's possible to get the driver-name.

ivangfr commented 4 years ago

In JDBC_PING script, I've implemented the creation of JGROUPSPING table as described above.

I am closing this issue. Please, feel free to reopen it in case I can help with something.

ivangfr / keycloak-clustered

Container won't start after initial run #3