Closed tivalat closed 11 years ago
I have no idea why PostgreSQL can't start. RA only starts PostgreSQL using postgres(default) user such as
postgres$ /usr/lib/postgresql/9.2/bin/pg_ctl (options such as -o -p 5432) start"
Dose Ubuntu's PostgreSQL have posgres user ? Did you see logs when "Time Out" is ossucred ?
Yes, Ubuntu has postgres user. I have started Postgres successfully by that user:
postgres@pm01:~$ /usr/lib/postgresql/9.2/bin/pg_ctl start -D /etc/postgresql/9.2/hapg -o "-p 5432"
server starting postgres@pm01:~$ 2012-12-12 14:27:04 ICT LOG: database system was shut down at 2012-12-12 14:17:29 ICT 2012-12-12 14:27:05 ICT LOG: database system is ready to accept connections 2012-12-12 14:27:05 ICT LOG: autovacuum launcher started
Here is syslog:
cat /var/log/syslog | grep crmd| grep "Dec 12"
http://paste.ubuntu.com/1426992/
I saw an error in the log.
I see more log here
root@pm01:~# cat /var/log/syslog | grep crmd| grep "Dec 12"
My problem is:
I could start Postgres by:
/usr/lib/postgresql/9.2/bin/pg_ctl start -D /etc/postgresql/9.2/main/
In your code, postgresql.conf and PG_VERSION are in the same dir.
Thank you for you infomation. I'm so busy. Could you wait a few days ?
Of course, no problem. Please review it when you have time.
I have just modified pgsql RA file:
+OCF_RESKEY_config_default=/etc/postgresql/9.2/main/postgresql.conf
-: ${OCF_RESKEY_config=${OCF_RESKEY_pgdata}/postgresql.conf}
+: ${OCF_RESKEY_config=${OCF_RESKEY_config_default}}
I met other error:
Dec 12 18:04:26 pm01 crmd: [10448]: WARN: status_from_rc: Action 5 (pgsql_start_0) on pm01 failed (target: 0 vs. rc: 6): Error
Dec 12 18:04:26 pm01 crmd: [10448]: WARN: update_failcount: Updating failcount for pgsql on pm01 after failed start: rc=6 (update=INFINITY, time=1355310266)
Full log here: http://paste.ubuntu.com/1428903/
Here is my config:
crm(live)configure# show
node pm01
node pm02
primitive pgsql ocf:heartbeat:pgsql \
params pgctl="/usr/lib/postgresql/9.2/bin/pg_ctl" psql="/usr/lib/postgresql/9.2/bin/psql" pgdata="/var/lib/postgresql/9.2/main" start_opt="-p 5432" rep_mode="sync" node_list="pm01 pm02" restore_command="cp /var/lib/postgresql/9.2/main/pg_archive/%f %p" primary_conninfo_opt="keepalives_idle=60 keepalives_interval=5 keepalives_count=5" master_ip="192.168.3.200" stop_escalate="0" \
op start interval="0s" timeout="60s" on-fail="restart" \
op monitor interval="7s" timeout="60s" on-fail="restart" \
op monitor interval="2s" role="Master" timeout="60s" on-fail="restart" \
op promote interval="0s" timeout="60s" on-fail="restart" \
op demote interval="0s" timeout="60s" on-fail="block" \
op stop interval="0s" timeout="60s" on-fail="block" \
op notify interval="0s" timeout="60s"
property $id="cib-bootstrap-options" \
dc-version="1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c" \
cluster-infrastructure="openais" \
expected-quorum-votes="2" \
no-quorum-policy="ignore" \
stonith-enabled="false" \
crmd-transition-delay="0s"
rsc_defaults $id="rsc-options" \
resource-stickiness="INFINITY" \
migration-threshold="1"
Hi
My pgdata folder is /var/lib/postgresql/9.2/main/. PG_VERSION file is in this folder.
Why do your PG_VERSION have "." ? Do you use customized PostgreSQL ? I don't consider it.
My config folder is /etc/postgresql/9.2/main. postgresql.conf file is in this folder.
You don't need a patch. You can specify the path of postgresql.conf using "config" parameter such as
params config="/etc/postgresql/9.2/main/postgresql.conf" pgctl="/usr/lib/postgresql/9.2/bin/pg_ctl" psql="/usr/lib/postgresql/9.2/bin/psql" pgdata="/var/lib/postgresql/9.2/main" start_opt="-p 5432" rep_mode="sync" node_list="pm01 pm02" restore_command="cp /var/lib/postgresql/9.2/main/pg_archive/%f %p" primary_conninfo_opt="keepalives_idle=60 keepalives_interval=5 keepalives_count=5" master_ip="192.168.3.200" stop_escalate="0" \
My sample configuration is for RedHat.
Sorry, I meant:
My pgdata folder is /var/lib/postgresql/9.2/main/.
PG_VERSION file is in this folder.
I did not customize Postgres yet.
I used your suggestion:
crm(live)configure# show node pm01 node pm02 primitive pgsql ocf:heartbeat:pgsql \ params pgctl="/usr/lib/postgresql/9.2/bin/pg_ctl" psql="/usr/lib/postgresql/9.2/bin/psql" pgdata="/var/lib/postgresql/9.2/main" config="/etc/postgresql/9.2/main/postgresql.conf" start_opt="-p 5432" rep_mode="sync" node_list="pm01 pm02" restore_command="cp /var/lib/postgresql/9.2/main/pg_archive/%f %p" primary_conninfo_opt="keepalives_idle=60 keepalives_interval=5 keepalives_count=5" master_ip="192.168.3.200" stop_escalate="0" \ op start interval="0s" timeout="60s" on-fail="restart" \ op monitor interval="7s" timeout="60s" on-fail="restart" \ op monitor interval="2s" role="Master" timeout="60s" on-fail="restart" \ op promote interval="0s" timeout="60s" on-fail="restart" \ op demote interval="0s" timeout="60s" on-fail="block" \ op stop interval="0s" timeout="60s" on-fail="block" \ op notify interval="0s" timeout="60s" property $id="cib-bootstrap-options" \ dc-version="1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c" \ cluster-infrastructure="openais" \ expected-quorum-votes="2" \ no-quorum-policy="ignore" \ stonith-enabled="false" \ crmd-transition-delay="0s" rsc_defaults $id="rsc-options" \ resource-stickiness="INFINITY" \ migration-threshold="1"
and your original RA.
But the error still displayed:
Dec 13 12:23:08 pm01 crmd: [32163]: info: match_graph_event: Action pgsql_monitor_0 (4) confirmed on pm01 (rc=0) Dec 13 12:23:08 pm01 crmd: [32163]: info: te_rsc_command: Initiating action 3: probe_complete probe_complete on pm01 (local) - no waiting Dec 13 12:23:08 pm01 crmd: [32163]: info: te_pseudo_action: Pseudo action 2 fired and confirmed Dec 13 12:23:08 pm01 crmd: [32163]: info: te_rsc_command: Initiating action 5: start pgsql_start_0 on pm01 (local) Dec 13 12:23:08 pm01 crmd: [32163]: info: do_lrm_rsc_op: Performing key=5:0:0:283e0f3c-db1e-4a3b-bdda-e688b195c4eb op=pgsql_start_0 ) Dec 13 12:23:08 pm01 crmd: [32163]: info: process_lrm_event: LRM operation pgsql_start_0 (call=3, rc=6, cib-update=27, confirmed=true) not configured Dec 13 12:23:08 pm01 crmd: [32163]: WARN: status_from_rc: Action 5 (pgsql_start_0) on pm01 failed (target: 0 vs. rc: 6): Error Dec 13 12:23:08 pm01 crmd: [32163]: WARN: update_failcount: Updating failcount for pgsql on pm01 after failed start: rc=6 (update=INFINITY, time=1355376188)
Full log here
It seems that your log dosen't have RA's log.
Pgsql RA outputs some useful logs. Please take RA's log too.
I'm afraid I don't know to set it up in Pacemaker 1.1.6 on Ubuntu. Maybe you can take it with debug mode.
I installed Pacemaker from Ubuntu's repository.
apt-get install pacemaker corosync
No other actions are required.
Ooops, Why do you grep with "crmd" ? It's not full log.
Sorry, this is the full log
I saw this error in the log:
Dec 14 11:58:53 pm01 pgsql[18254]: ERROR: Replication requires Master/Slave configuration. Dec 14 11:58:53 pm01 pgsql[18254]: INFO: Changing pgsql-status on pm01 : ->UNKNOWN.
But I have checked my Postgres Replication config:
pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | state | sent_location | write_location | flush_location | replay_location | sync_priority | sync_state -------+----------+----------+------------------+---------------+-----------------+-------------+-------------------------------+-----------+---------------+----------------+----------------+-----------------+---------------+------------ 26014 | 10 | postgres | walreceiver | 192.168.25.80 | | 35950 | 2012-12-14 12:07:08.264074+07 | streaming | 0/E000080 | 0/E000080 | 0/E000080 | 0/E000080 | 1 | sync (1 row)
Oh, your configuration has no Master/Slave. Replication mode needs it.
Even if you add Master/Slave configuration, You can't accomplish HA clustering of PostgreSQL replication. Replication mode needs IP (vip-rep) and location and colocation at the very least.
I have added Master/Slave and other configuration.
The error is resolved. But Postgres on Master started in recovery mode, then down.
Here is the log:
http://paste.ubuntu.com/1445042/
crm_mon output:
Last updated: Mon Dec 17 17:58:34 2012 Last change: Mon Dec 17 17:57:25 2012 via crm_attribute on pm01 Stack: openais Current DC: pm01 - partition WITHOUT quorum Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c 2 Nodes configured, 2 expected votes
Online: [ pm01 ] OFFLINE: [ pm02 ]
Full list of resources:
Clone Set: clnPingCheck [pingCheck] Started: [ pm01 ] Stopped: [ pingCheck:1 ] vip-slave (ocf::heartbeat:IPaddr2): Stopped Resource Group: master-group vip-master (ocf::heartbeat:IPaddr2): Stopped vip-rep (ocf::heartbeat:IPaddr2): Stopped Master/Slave Set: msPostgresql [pgsql] Stopped: [ pgsql:0 pgsql:1 ]
Node Attributes:
Migration summary:
Failed actions: pgsql:0_start_0 (node=pm01, call=16, rc=-2, status=Timed Out): unknown exec error
I have read this issue
https://github.com/t-matsuo/resource-agents/issues/15
I have configured Replication manually successfully.
Here is my configuration
The error is resolved. But Postgres on Master started in recovery mode, then down.
It's normal to start in recovery mode. Can you start PostgreSQL in recovery mode manually ? And can you "SELECT pg_is_in_recovery()" ?
If you can't SELECT, please set up PostgreSQL appropriately.
BTW, PostgreSQL can't start in recovery mode unless it connects to Master just one time manually.
And I heard that current PostgreSQL 9.2.x has a bug in recovery.
Yes, I have configured Postgres in recovery mode successfully.
postgres=# select pg_is_in_recovery();
t (1 row)
I still cannot start Master node.
Last updated: Wed Dec 19 19:06:38 2012 Last change: Wed Dec 19 15:33:40 2012 via crm_attribute on pm01 Stack: openais Current DC: pm01 - partition with quorum Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c 2 Nodes configured, 2 expected votes
Online: [ pm01 pm02 ]
Full list of resources:
Clone Set: clnPingCheck [pingCheck] Started: [ pm02 pm01 ] vip-slave (ocf::heartbeat:IPaddr2): Stopped Resource Group: master-group vip-master (ocf::heartbeat:IPaddr2): Stopped vip-rep (ocf::heartbeat:IPaddr2): Stopped Master/Slave Set: msPostgresql [pgsql] Slaves: [ pm02 ] Stopped: [ pgsql:1 ]
Node Attributes:
Migration summary:
Failed actions: pgsql:1_start_0 (node=pm01, call=17, rc=-2, status=Timed Out): unknown exec error
My configuration
no-quorum-policy="ignore" \
stonith-enabled="false" \
crmd-transition-delay="0s"
rsc_defaults $id="rsc-options" \ resource-stickiness="INFINITY" \ migration-threshold="1" (END)
Yes, I have configured Postgres in recovery mode successfully.
In pm01 ? Crm_mon says that Pacemaker can't start PostgreSQL in pm01.
pm01 is Postgres master, pm02 is Postgres slave.
So I configured pm02 in recovery mode.
RA makes recovery.conf automatically. You don't need to make it.
On 12/26/2012 08:58 AM, Takatoshi MATSUO wrote:
RA makes recovery.conf automatically. You don't need to make it.
— Reply to this email directly or view it on GitHub https://github.com/t-matsuo/resource-agents/issues/21#issuecomment-11679726.
I tested Replication manually successfully.
I deleted recovery.conf files, stop Postgres before starting pacemaker. I also deleted "synchronous_standby_names" as instruction.
I also checked with Postgres 9.1. The problem is identical.
We can't decide Master. RA selects it.
RA makes recovery.conf automatically when starting. And RA promote PostgreSQL which has newest data. So you don't need to make reocvery.conf.
Please paste newest log (both pm01 and pm02)
Last updated: Tue Dec 25 18:18:30 2012 Last change: Tue Dec 25 18:14:42 2012 via crm_attribute on pm02 Stack: openais Current DC: pm01 - partition with quorum Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c 2 Nodes configured, 2 expected votes
Online: [ pm01 pm02 ]
Full list of resources:
vip-slave (ocf::heartbeat:IPaddr2): Started pm02 Resource Group: master-group vip-master (ocf::heartbeat:IPaddr2): Started pm02 vip-rep (ocf::heartbeat:IPaddr2): Started pm02 Master/Slave Set: msPostgresql [pgsql] Masters: [ pm02 ] Stopped: [ pgsql:0 ] Clone Set: clnPingCheck [pingCheck] Started: [ pm01 pm02 ]
Node Attributes:
Migration summary:
Failed actions: pgsql:0_start_0 (node=pm01, call=10, rc=-2, status=Timed Out): unknown exec error
The log is here:
Dec 26 16:11:24 pm02 pgsql[4709]: INFO: Don't check /var/lib/postgresql/9.1/main/ during probe Dec 26 16:11:24 pm02 pgsql[4709]: INFO: PostgreSQL is down Dec 26 16:11:32 pm02 pgsql[4985]: INFO: Changing pgsql-status on pm02 : ->STOP. Dec 26 16:11:32 pm02 pgsql[4985]: INFO: Set all nodes into async mode. Dec 26 16:11:32 pm02 pgsql[4985]: INFO: server starting Dec 26 16:11:32 pm02 pgsql[4985]: INFO: PostgreSQL start command sent. Dec 26 16:11:32 pm02 pgsql[4985]: WARNING: PostgreSQL template1 isn't running Dec 26 16:11:32 pm02 pgsql[4985]: WARNING: Connection error (connection to the server went bad and the session was not interactive) occurred while executing the psql command. Dec 26 16:11:33 pm02 pgsql[4985]: INFO: PostgreSQL is started. Dec 26 16:11:33 pm02 pgsql[4985]: INFO: Changing pgsql-status on pm02 : STOP->HS:alone. Dec 26 16:11:34 pm02 pgsql[5220]: INFO: Master does not exist. Dec 26 16:11:34 pm02 pgsql[5220]: INFO: My data status=. Dec 26 16:11:34 pm02 pgsql[5220]: WARNING: Can't get pm01 xlog location. Dec 26 16:11:34 pm02 pgsql[5220]: INFO: pm02 xlog location : 0000000006000078 Dec 26 16:11:41 pm02 pgsql[5451]: INFO: Master does not exist. Dec 26 16:11:41 pm02 pgsql[5451]: INFO: My data status=. Dec 26 16:11:41 pm02 pgsql[5451]: WARNING: Can't get pm01 xlog location. Dec 26 16:11:41 pm02 pgsql[5451]: INFO: pm02 xlog location : 0000000006000078 Dec 26 16:11:48 pm02 pgsql[5696]: INFO: Master does not exist. Dec 26 16:11:48 pm02 pgsql[5696]: INFO: My data status=. Dec 26 16:11:48 pm02 pgsql[5696]: WARNING: Can't get pm01 xlog location. Dec 26 16:11:48 pm02 pgsql[5696]: INFO: pm02 xlog location : 0000000006000078 Dec 26 16:11:56 pm02 pgsql[5946]: INFO: Master does not exist. Dec 26 16:11:56 pm02 pgsql[5946]: INFO: My data status=. Dec 26 16:11:56 pm02 pgsql[5946]: WARNING: Can't get pm01 xlog location. Dec 26 16:11:56 pm02 pgsql[5946]: INFO: pm02 xlog location : 0000000006000078 Dec 26 16:11:56 pm02 pgsql[5946]: INFO: I have a master right. Dec 26 16:11:56 pm02 pgsql[6125]: INFO: Changing pgsql-data-status on pm01 : ->DISCONNECT. Dec 26 16:11:57 pm02 pgsql[6125]: INFO: Creating /var/lib/pgsql/tmp/PGSQL.lock. Dec 26 16:11:57 pm02 pgsql[6125]: INFO: My master baseline : 0000000006000078. Dec 26 16:11:57 pm02 pgsql[6125]: INFO: server promoting Dec 26 16:11:57 pm02 pgsql[6125]: INFO: PostgreSQL promote command sent. Dec 26 16:11:58 pm02 pgsql[6125]: INFO: PostgreSQL is promoted. Dec 26 16:11:59 pm02 pgsql[6125]: INFO: Changing pgsql-data-status on pm02 : ->LATEST. Dec 26 16:12:00 pm02 pgsql[6125]: INFO: Changing pgsql-status on pm02 : HS:alone->PRI. Dec 26 16:31:44 pm02 pgsql[21705]: INFO: Don't check /var/lib/postgresql/9.1/main/ during probe Dec 26 16:31:44 pm02 pgsql[21705]: INFO: Changing pgsql-data-status on pm02 : ->LATEST. Dec 26 16:31:45 pm02 pgsql[21705]: INFO: Changing pgsql-data-status on pm01 : ->DISCONNECT.
Can you start and select in pm01 manually?
make /var/lib/postgresql/9.2/main/recovery.conf
standby_mode = 'on'
primary_conninfo = 'host=192.168.3.200 port=5432 user=postgres application_name=pm01 keepalives_idle=60 keepalives_interval=5 keepalives_count=5'
restore_command = 'cp /var/lib/postgresql/9.2/main/pg_archive/%f %p'
recovery_target_timeline = 'latest'
start and select
pm01# chown postgres:postgres /var/lib/postgresql/9.2/main/recovery.conf
pm01# su postgres -c "cd /var/lib/postgresql/9.2/main; /usr/lib/postgresql/9.2/bin/pg_ctl -D /var/lib/postgresql/9.2/main -o 'config_file=/etc/postgresql/9.2/main/postgresql.conf -p 5432' stop"
pm01# su postgres -c "cd /var/lib/postgresql/9.2/main; /usr/lib/postgresql/9.2/bin/pg_ctl -D /var/lib/postgresql/9.2/main -o 'config_file=/etc/postgresql/9.2/main/postgresql.conf -p 5432' start"
pm01# su postgres -c 'cd /var/lib/postgresql/9.2/main; /usr/lib/postgresql/9.2/bin/psql -c "SELECT pg_is_in_recovery()"
pg_is_in_recovery'
-------------------
t
(1 row)
It seems that your log failed last select operation.
I can do HA with Postgres 9.1 now.
There were some problems with my Replication.
Last updated: Sat Dec 29 12:24:02 2012 Last change: Sat Dec 29 12:21:32 2012 via crm_attribute on pm02 Stack: openais Current DC: pm02 - partition with quorum Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c 2 Nodes configured, 2 expected votes
Online: [ pm01 pm02 ]
Full list of resources:
vip-slave (ocf::heartbeat:IPaddr2): Started pm01 Resource Group: master-group vip-master (ocf::heartbeat:IPaddr2): Started pm02 vip-rep (ocf::heartbeat:IPaddr2): Started pm02 Master/Slave Set: msPostgresql [pgsql] Masters: [ pm02 ] Slaves: [ pm01 ] Clone Set: clnPingCheck [pingCheck] Started: [ pm01 pm02 ]
Node Attributes:
Migration summary:
I will try with 9.2 and deal with circumstance that the old primary is online back. I want to make it becomes new slave automatically.
Thanks for your full support.
P/S: I raised other issue :)
Hi,
I used your RA on Ubuntu 12.04, Pacemaker 1.1.6, Postgres 9.2.
I edited paths as follow:
And there was an error:
I have not configured another yet. I also note that Pacemaker 1.1.6 uses time parameter without "s" (e.g timeout="60").
I have configure Postgres Replication successfully, checked on both master and slave:
I tried to start Postgres manually by:
(hapg is my Postgres cluster)
Here is crm_mon output:
Do you have any idea?