Failed node cannot join back

kosztyua commented 8 years ago

Hi, I have just setup a simple 5 node cluster and testing how recovery could work... Whenever a node fails (kill -9, docker restart, etc) that node will not join back with this error:

2016-06-28 08:14:16 1899 [ERROR] WSREP: gcs/src/gcs_group.cpp:group_post_state_exchange():321: Reversing history: 935 -> 934, this member has applied 1 more events than the primary component.Data loss is possible. Aborting.

It would seem something makes an extra transaction while the environment is setting up. Using default current docker hub build, ran at least a dozen scenarios and with each I get the same result.
After this only recover is possible by deleting grastate.dat, but that defeats the purpose if I want to make a HA cluster. Any guess what could be the issue?

marcovc commented 8 years ago

I had exactly the same issue. I don't remember exactly what I did to solve it, but it involved changing the scripts in the /bin directory. If you're interested I can try to diff them with the current version.

kosztyua commented 8 years ago

I would really appreciate that :) I was considering the FLUSH PRIVILEGES may have that effect, as that is the only which does not have wsrep off

marcovc commented 8 years ago

It appears the current version has changed significatively from the version I have. Anyways, my problem was in the bin/functions/init_database() function. For some reason starting mysql twice (the first time just to setup passwords) for an existing database was triggering the error you mentioned. I removed that funcionality since I don't really need the scripts to reset passwords dynamically and it has been running great ever since. My version is the following (watch out for the SRC_PATH env variable which is mine).

`function init_database() {

chown -R mysql:mysql /var/lib/mysql touch /var/log/mysql/error.log tail ---disable-inotify -F /var/log/mysql/error.log & if [[ ! -d /var/lib/mysql/mysql ]]; then echo "==> An empty or uninitialized database is detected in /var/lib/mysql" echo "-----> Creating database..." mysql_install_db > /dev/null 2>&1 echo "==> starting mysql in order to set up passwords" mysqld_safe --skip-syslog --verbose & echo "-----> sleeping for 20 seconds, then testing if DB is up" sleep 20 while [[ -z $(netstat -lnt | awk "\$6 == \"LISTEN\" && \$4 ~ \".$PUBLISH\" && \$1 ~ \"$PROTO.?\"") ]] ; do sleep 1; done [[ -z $HOST ]] && mysql_creds || ${SRC_PATH}/bin/database_creds echo "==> stopping mysql after setting up passwords" mysqladmin shutdown echo "-----> Done!" else echo "-----> Using an existing database" fi }`

kosztyua commented 8 years ago

Thank you, I'll test it and report back!

kosztyua commented 8 years ago

Tested both removing init_database() and adding SET wsrep_on=OFF to FLUSH PRIVILEGES, and both worked fine! Creating a pull request for the later, as it can be a general solution :) Thank you for the help!

paulczar / docker-percona_galera

Failed node cannot join back #15