kytos-ng / kytos-end-to-end-tests

Kytos End-to-End Tests
0 stars 10 forks source link

gitlab_rs_init can stay forever waiting for first node to become primary #311

Open italovalcy opened 4 weeks ago

italovalcy commented 4 weeks ago

Hi,

I've been observing that some CI/CD jobs are not finishing and failing on the MongoDB init script with the following error:

Configuring replica set
Wrote 172.17.0.4:27017,172.17.0.5:27018,172.17.0.6:27019 to /tmp/host_seeds.txt
Waiting for node mongo1 to become primary
Waiting for the first node to be PRIMARY, current: None
Waiting for the first node to be PRIMARY, current: SECONDARY
Waiting for the first node to be PRIMARY, current: SECONDARY
Waiting for the first node to be PRIMARY, current: SECONDARY
Waiting for the first node to be PRIMARY, current: SECONDARY
Waiting for the first node to be PRIMARY, current: SECONDARY
Waiting for the first node to be PRIMARY, current: SECONDARY
Waiting for the first node to be PRIMARY, current: SECONDARY
Waiting for the first node to be PRIMARY, current: SECONDARY
Waiting for the first node to be PRIMARY, current: SECONDARY
Waiting for the first node to be PRIMARY, current: SECONDARY
Waiting for the first node to be PRIMARY, current: SECONDARY
Waiting for the first node to be PRIMARY, current: SECONDARY
Waiting for the first node to be PRIMARY, current: SECONDARY
Waiting for the first node to be PRIMARY, current: SECONDARY
Waiting for the first node to be PRIMARY, current: SECONDARY
Waiting for the first node to be PRIMARY, current: SECONDARY
Waiting for the first node to be PRIMARY, current: SECONDARY
Waiting for the first node to be PRIMARY, current: SECONDARY
Waiting for the first node to be PRIMARY, current: SECONDARY
Waiting for the first node to be PRIMARY, current: SECONDARY
Waiting for the first node to be PRIMARY, current: SECONDARY
Waiting for the first node to be PRIMARY, current: SECONDARY
Waiting for the first node to be PRIMARY, current: SECONDARY
Waiting for the first node to be PRIMARY, current: SECONDARY
Waiting for the first node to be PRIMARY, current: SECONDARY
Waiting for the first node to be PRIMARY, current: SECONDARY
Waiting for the first node to be PRIMARY, current: SECONDARY
Waiting for the first node to be PRIMARY, current: SECONDARY
Waiting for the first node to be PRIMARY, current: SECONDARY
Waiting for the first node to be PRIMARY, current: SECONDARY
Waiting for the first node to be PRIMARY, current: SECONDARY
Waiting for the first node to be PRIMARY, current: SECONDARY
Waiting for the first node to be PRIMARY, current: SECONDARY
Waiting for the first node to be PRIMARY, current: SECONDARY
Waiting for the first node to be PRIMARY, current: SECONDARY
Waiting for the first node to be PRIMARY, current: SECONDARY
Waiting for the first node to be PRIMARY, current: SECONDARY
Waiting for the first node to be PRIMARY, current: SECONDARY
Waiting for the first node to be PRIMARY, current: SECONDARY
Waiting for the first node to be PRIMARY, current: SECONDARY
Waiting for the first node to be PRIMARY, current: SECONDARY
Waiting for the first node to be PRIMARY, current: SECONDARY
Waiting for the first node to be PRIMARY, current: SECONDARY
Waiting for the first node to be PRIMARY, current: SECONDARY
Waiting for the first node to be PRIMARY, current: SECONDARY
Waiting for the first node to be PRIMARY, current: SECONDARY
Waiting for the first node to be PRIMARY, current: SECONDARY
Waiting for the first node to be PRIMARY, current: SECONDARY
Waiting for the first node to be PRIMARY, current: SECONDARY
Waiting for the first node to be PRIMARY, current: SECONDARY
Waiting for the first node to be PRIMARY, current: SECONDARY
....

The above log keeps going up to the configured timeout of the pipeline.

Running the CI/CD job manually to troubleshoot, I was able to identify two things:

  1. We were supposing to be configuring the "first node" as the PRIMARY (as far as I could understand from the gitlab_rs_init.py script), however we are using the priority i * 10 == 10 to do that, while mongodb replicaset config suggest the usage of higher values to increase the likelihood of a node to become primary (https://www.mongodb.com/docs/manual/reference/replica-configuration/)

A number that indicates the relative likelihood of a replica set member to become the primary. To increase the likelihood that a member becomes the primary, specify a higher priority value for that member. To decrease the likelihood that a member becomes the primary, specify a lower priority value for that member.

  1. Sometimes the primary node won't be the first member of the replSetGetStatus command (probably due to the issue 1 above?):
    >>> client = MongoClient(hosts[first_node]["host_port"], directConnection=True)
    >>> response = client.admin.command("replSetGetStatus")
    >>> response["members"][0]
    {'_id': 1, 'name': '172.17.0.4:27017', 'health': 1.0, 'state': 2, 'stateStr': 'SECONDARY', 'uptime': 603, 'optime': {'ts': Timestamp(1718093647, 1), 't': 2}, 'optimeDate': datetime.datetime(2024, 6, 11, 8, 14, 7), 'lastAppliedWallTime': datetime.datetime(2024, 6, 11, 8, 14, 7, 339000), 'lastDurableWallTime': datetime.datetime(2024, 6, 11, 8, 14, 7, 339000), 'syncSourceHost': '172.17.0.5:27018', 'syncSourceId': 2, 'infoMessage': '', 'configVersion': 1, 'configTerm': 2, 'self': True, 'lastHeartbeatMessage': ''}
    >>> response["members"][1]
    {'_id': 2, 'name': '172.17.0.5:27018', 'health': 1.0, 'state': 2, 'stateStr': 'SECONDARY', 'uptime': 541, 'optime': {'ts': Timestamp(1718093647, 1), 't': 2}, 'optimeDurable': {'ts': Timestamp(1718093647, 1), 't': 2}, 'optimeDate': datetime.datetime(2024, 6, 11, 8, 14, 7), 'optimeDurableDate': datetime.datetime(2024, 6, 11, 8, 14, 7), 'lastAppliedWallTime': datetime.datetime(2024, 6, 11, 8, 14, 7, 339000), 'lastDurableWallTime': datetime.datetime(2024, 6, 11, 8, 14, 7, 339000), 'lastHeartbeat': datetime.datetime(2024, 6, 11, 8, 14, 15, 555000), 'lastHeartbeatRecv': datetime.datetime(2024, 6, 11, 8, 14, 16, 101000), 'pingMs': 0, 'lastHeartbeatMessage': '', 'syncSourceHost': '172.17.0.6:27019', 'syncSourceId': 3, 'infoMessage': '', 'configVersion': 1, 'configTerm': 2}
    >>> response["members"][2]
    {'_id': 3, 'name': '172.17.0.6:27019', 'health': 1.0, 'state': 1, 'stateStr': 'PRIMARY', 'uptime': 541, 'optime': {'ts': Timestamp(1718093647, 1), 't': 2}, 'optimeDurable': {'ts': Timestamp(1718093647, 1), 't': 2}, 'optimeDate': datetime.datetime(2024, 6, 11, 8, 14, 7), 'optimeDurableDate': datetime.datetime(2024, 6, 11, 8, 14, 7), 'lastAppliedWallTime': datetime.datetime(2024, 6, 11, 8, 14, 7, 339000), 'lastDurableWallTime': datetime.datetime(2024, 6, 11, 8, 14, 7, 339000), 'lastHeartbeat': datetime.datetime(2024, 6, 11, 8, 14, 15, 559000), 'lastHeartbeatRecv': datetime.datetime(2024, 6, 11, 8, 14, 15, 561000), 'pingMs': 0, 'lastHeartbeatMessage': '', 'syncSourceHost': '', 'syncSourceId': -1, 'infoMessage': '', 'electionTime': Timestamp(1718093137, 1), 'electionDate': datetime.datetime(2024, 6, 11, 8, 5, 37), 'configVersion': 1, 'configTerm': 2}