Census not present on all nodes

bodymindarts commented 7 years ago

I have this docker-compose file with 1 standalone postgres node (group postgresql.default) and a 3 node cluster (group postgresql.cluster) that gets bootstrapped by peering with the standalone node. The containers that get brought up are built using the latest habitat 0.29.1

The bootstrapping and clustering works but when querying the standalone node the cluster nodes aren't visible in the returned census data.

To reproduce run the following in this folder:

$ docker-compose up -d
$ docker-compose run tests curl -s standalone:9631/census | jq -r '.census_groups | keys'
Starting tests_standalone_1 ... done
[
  "postgresql.default"
]
$ docker-compose run tests curl -s pg1:9631/census | jq -r '.census_groups | keys'
Starting tests_standalone_1 ... done
[
  "postgresql.cluster",
  "postgresql.default"
]

smartb-pair commented 7 years ago

I think I'm seeing this as well on a standard VM.

reset commented 7 years ago

This looks like a network connectivity issue. Can you verify that TCP/UDP ports are open and communication is flowing between all nodes over the port you've configured (default: 9638)?

bodymindarts commented 7 years ago

This is running locally on docker for mac! No way its a connection issue.

bodymindarts commented 7 years ago

Just made sure and can confirm that I am able to telnet from every node to every other node on port 9638.

smartb-pair commented 7 years ago

(Note: I'm assuming the behavior I'm seeing is the same issue that @bodymindarts is reporting.) I'm able to talk to my permanent peer's supervisor over HTTP as well as hitting the UDP port with netcat. Census and butterfly data look like there are missing group members:

root@dev-postgres-0:~# curl -s localhost:9631/census | jq .[]
true
{
  "postgresql.dev": {
    "service_group": "postgresql.dev",
    "election_status": "ElectionNoQuorum",
    "update_election_status": "None",
    "leader_id": null,
    "service_config": null,
    "local_member_id": "a529235923ab4ce38b3abf9f5413ff46",
    "population": {
      "a529235923ab4ce38b3abf9f5413ff46": {
        "member_id": "a529235923ab4ce38b3abf9f5413ff46",
        "pkg": {
          "origin": "core",
          "name": "postgresql",
          "version": "9.6.3",
          "release": "20170727171300"
        },
        "application": null,
        "environment": null,
        "service": "postgresql",
        "group": "dev",
        "org": null,
        "initialized": false,
        "persistent": false,
        "leader": false,
        "follower": false,
        "update_leader": false,
        "update_follower": false,
        "election_is_running": false,
        "election_is_no_quorum": true,
        "election_is_finished": false,
        "update_election_is_running": false,
        "update_election_is_no_quorum": false,
        "update_election_is_finished": false,
        "alive": true,
        "suspect": false,
        "confirmed": false,
        "departed": false,
        "sys": {
          "ip": "10.224.74.58",
          "hostname": "dev-postgres-0",
          "gossip_ip": "0.0.0.0",
          "gossip_port": 9638,
          "http_gateway_ip": "0.0.0.0",
          "http_gateway_port": 9631
        },
        "cfg": {
          "port": "5432",
          "superuser_name": "admin",
          "superuser_password": "admin"
        }
      }
    },
    "update_leader_id": null,
    "changed_service_files": [],
    "service_files": {}
  }
}
"a529235923ab4ce38b3abf9f5413ff46"
1
1
0
0
0
0

root@dev-postgres-0:~# curl -s localhost:9631/butterfly | jq .[]
{
  "members": {},
  "health": {},
  "update_counter": 0
}
{
  "list": {
    "postgresql.dev": {
      "a529235923ab4ce38b3abf9f5413ff46": {
        "type": 2,
        "tag": [],
        "from_id": "a529235923ab4ce38b3abf9f5413ff46",
        "service": {
          "member_id": "a529235923ab4ce38b3abf9f5413ff46",
          "service_group": "postgresql.dev",
          "package": "core/postgresql/9.6.3/20170727171300",
          "incarnation": 1,
          "cfg": {
            "port": "5432",
            "superuser_name": "admin",
            "superuser_password": "admin"
          },
          "sys": {
            "ip": "10.224.74.58",
            "hostname": "dev-postgres-0",
            "gossip_ip": "0.0.0.0",
            "gossip_port": 9638,
            "http_gateway_ip": "0.0.0.0",
            "http_gateway_port": 9631
          },
          "initialized": false
        }
      }
    }
  },
  "update_counter": 1
}
{
  "list": {},
  "update_counter": 0
}
{
  "list": {},
  "update_counter": 0
}
{
  "list": {
    "postgresql.dev": {
      "election": {
        "type": 3,
        "tag": [],
        "from_id": "a529235923ab4ce38b3abf9f5413ff46",
        "election": {
          "member_id": "a529235923ab4ce38b3abf9f5413ff46",
          "service_group": "postgresql.dev",
          "term": 0,
          "suitability": 0,
          "status": 2,
          "votes": [
            "a529235923ab4ce38b3abf9f5413ff46"
          ]
        }
      }
    }
  },
  "update_counter": 1
}
{
  "list": {},
  "update_counter": 0
}
{
  "list": {},
  "update_counter": 0
}

reset commented 7 years ago

I'm pretty sure this is because the postgresql plan in master is broken - it blocks the supervisor's main thread by calling the hab binary from within it's post-run hook.

There are two rules in hooks:

You can't sleep/block in anything other than run
You can't call hab from within a hook

bodymindarts commented 7 years ago

Okay that might be the issue. This used to work on an earlier version. I'll see if I can fix PG to take those 'rules' into account.

reset commented 7 years ago

Gonna close this since it's the postgresql plan. This is the link to the ticket we're working on to enable multiple services per package: https://github.com/habitat-sh/habitat/issues/2902

bodymindarts commented 7 years ago

I can confirm that this issue doesn't come up when no sidecar process is being run.

habitat-sh / habitat

Census not present on all nodes #2976