hep-gc / cloudscheduler

Apache License 2.0
5 stars 2 forks source link

Improve condor command poller to mark/skip problem VMs #287

Closed colsond closed 4 years ago

colsond commented 4 years ago

On production we had an issue where one of the clouds became unreachable while there was a bunch of VMs still active. csv2 was unable to terminate these machines and the poller was getting constant timeouts greatly increasing the cycle time slowing down operation for other VMs.

We need to come up with a strategy that will identify unreachable VMs (or perhaps clouds) and temporarily skip them instead of getting hundreds of timeouts leaving other vm operations waiting for potential hours.

One solution may be to keep a local record of VM hosts that timeout/fail and only issue commands to these problem VMs every 10th cycle. Another strategy could be to detect when a cloud is unreachable and maintain a list somewhere in the csv2 database accessible to the pollers so they know which clouds to skip.

rseuster commented 4 years ago

once we established connection to hephy again, we see this crash:

2020-02-11 09:19:29,612 - VM Poller    - 976254 - INFO - VM updates committed: 1
2020-02-11 09:19:32,317 - VM Poller    - 976254 - ERROR - VM poller cycle while loop exception, process terminating...
Traceback (most recent call last):
  File "/usr/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 1244, in _execute_context
    cursor, statement, parameters, context
  File "/usr/lib64/python3.6/site-packages/sqlalchemy/engine/default.py", line 552, in do_execute
    cursor.execute(statement, parameters)
  File "/usr/lib64/python3.6/site-packages/MySQLdb/cursors.py", line 206, in execute
    res = self._query(query)
  File "/usr/lib64/python3.6/site-packages/MySQLdb/cursors.py", line 312, in _query
    db.query(q)
  File "/usr/lib64/python3.6/site-packages/MySQLdb/connections.py", line 224, in query
    _mysql.connection.query(self, query)
MySQLdb._exceptions.IntegrityError: (1062, "Duplicate entry 'atlas-cern-hephy-984aebac-54f3-4c1f-987f-f3d1f2c1b782' for key 'PRIMARY'")

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/cloudscheduler/data_collectors/openstack/openstackPoller.py", line 1461, in vm_poller
    db_session.commit()
  File "/usr/lib64/python3.6/site-packages/sqlalchemy/orm/session.py", line 1026, in commit
    self.transaction.commit()
  File "/usr/lib64/python3.6/site-packages/sqlalchemy/orm/session.py", line 493, in commit
    self._prepare_impl()
  File "/usr/lib64/python3.6/site-packages/sqlalchemy/orm/session.py", line 472, in _prepare_impl
    self.session.flush()
  File "/usr/lib64/python3.6/site-packages/sqlalchemy/orm/session.py", line 2451, in flush
    self._flush(objects)
  File "/usr/lib64/python3.6/site-packages/sqlalchemy/orm/session.py", line 2589, in _flush
    transaction.rollback(_capture_exception=True)
  File "/usr/lib64/python3.6/site-packages/sqlalchemy/util/langhelpers.py", line 68, in __exit__
    compat.reraise(exc_type, exc_value, exc_tb)
  File "/usr/lib64/python3.6/site-packages/sqlalchemy/util/compat.py", line 129, in reraise
    raise value
  File "/usr/lib64/python3.6/site-packages/sqlalchemy/orm/session.py", line 2549, in _flush
    flush_context.execute()
  File "/usr/lib64/python3.6/site-packages/sqlalchemy/orm/unitofwork.py", line 422, in execute
    rec.execute(self)
  File "/usr/lib64/python3.6/site-packages/sqlalchemy/orm/unitofwork.py", line 589, in execute
    uow,
  File "/usr/lib64/python3.6/site-packages/sqlalchemy/orm/persistence.py", line 245, in save_obj
    insert,
  File "/usr/lib64/python3.6/site-packages/sqlalchemy/orm/persistence.py", line 1066, in _emit_insert_statements
    c = cached_connections[connection].execute(statement, multiparams)
  File "/usr/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 988, in execute
    return meth(self, multiparams, params)
  File "/usr/lib64/python3.6/site-packages/sqlalchemy/sql/elements.py", line 287, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "/usr/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 1107, in _execute_clauseelement
    distilled_params,
  File "/usr/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 1248, in _execute_context
    e, statement, parameters, cursor, context
  File "/usr/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 1466, in _handle_dbapi_exception
    util.raise_from_cause(sqlalchemy_exception, exc_info)
  File "/usr/lib64/python3.6/site-packages/sqlalchemy/util/compat.py", line 383, in raise_from_cause
    reraise(type(exception), exception, tb=exc_tb, cause=cause)
  File "/usr/lib64/python3.6/site-packages/sqlalchemy/util/compat.py", line 128, in reraise
    raise value.with_traceback(tb)
  File "/usr/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 1244, in _execute_context
    cursor, statement, parameters, context
  File "/usr/lib64/python3.6/site-packages/sqlalchemy/engine/default.py", line 552, in do_execute
    cursor.execute(statement, parameters)
  File "/usr/lib64/python3.6/site-packages/MySQLdb/cursors.py", line 206, in execute
    res = self._query(query)
  File "/usr/lib64/python3.6/site-packages/MySQLdb/cursors.py", line 312, in _query
    db.query(q)
  File "/usr/lib64/python3.6/site-packages/MySQLdb/connections.py", line 224, in query
    _mysql.connection.query(self, query)
sqlalchemy.exc.IntegrityError: (MySQLdb._exceptions.IntegrityError) (1062, "Duplicate entry 'atlas-cern-hephy-984aebac-54f3-4c1f-987f-f3d1f2c1b782' for key 'PRIMARY'")
[SQL: INSERT INTO csv2_vms (group_name, cloud_name, target_alias, region, vmid, instance_id, cloud_type, vm_ips, vm_floating_ips, auth_url, project, hostname, status, flavor_id, image_id, task, power_status, htcondor_startd_errors, htcondor_startd_time, htcondor_partitionable_slots, htcondor_dynamic_slots, htcondor_slots_timestamp, retire_time, terminate_time, status_changed_time, last_updated, updater) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)]
[parameters: ('atlas-cern', 'hephy', None, 'RegionOne', '984aebac-54f3-4c1f-987f-f3d1f2c1b782', None, 'openstack', "['192.168.1.5']", '[]', 'https://balrog-ctl.uibk.ac.at:5000/v3', 'atlas', 'atlas-cern--hephy--1782--119318012926408', 'ACTIVE', '461aa9c8-5de2-48f8-a6ff-939596b22fa7', 'eff5a8e7-3408-4684-b623-9fd7ed48dd31', 'deleting', 1, None, None, None, None, None, None, None, None, 1581441564, None)]
(Background on this error at: http://sqlalche.me/e/gkpj)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/cloudscheduler/data_collectors/openstack/openstackPoller.py", line 1464, in vm_poller
    logging.exception("Failed to commit VM updates for %s::%s:%s, using group %s's credentials aborting cycle..." % (cloud_obj.authurl, cloud_obj.project, cloud_obj.region, cloud_obj.group_name))
  File "/usr/lib64/python3.6/site-packages/sqlalchemy/orm/attributes.py", line 276, in __get__
    return self.impl.get(instance_state(instance), dict_)
  File "/usr/lib64/python3.6/site-packages/sqlalchemy/orm/attributes.py", line 677, in get
    value = state._load_expired(state, passive)
  File "/usr/lib64/python3.6/site-packages/sqlalchemy/orm/state.py", line 660, in _load_expired
    self.manager.deferred_scalar_loader(self, toload)
  File "/usr/lib64/python3.6/site-packages/sqlalchemy/orm/loading.py", line 979, in load_scalar_attributes
    only_load_props=attribute_names,
  File "/usr/lib64/python3.6/site-packages/sqlalchemy/orm/loading.py", line 208, in load_on_ident
    identity_token=identity_token,
  File "/usr/lib64/python3.6/site-packages/sqlalchemy/orm/loading.py", line 282, in load_on_pk_identity
    return q.one()
  File "/usr/lib64/python3.6/site-packages/sqlalchemy/orm/query.py", line 3275, in one
    ret = self.one_or_none()
  File "/usr/lib64/python3.6/site-packages/sqlalchemy/orm/query.py", line 3244, in one_or_none
    ret = list(self)
  File "/usr/lib64/python3.6/site-packages/sqlalchemy/orm/query.py", line 3317, in __iter__
    return self._execute_and_instances(context)
  File "/usr/lib64/python3.6/site-packages/sqlalchemy/orm/query.py", line 3339, in _execute_and_instances
    querycontext, self._connection_from_session, close_with_result=True
  File "/usr/lib64/python3.6/site-packages/sqlalchemy/orm/query.py", line 3354, in _get_bind_args
    mapper=self._bind_mapper(), clause=querycontext.statement, **kw
  File "/usr/lib64/python3.6/site-packages/sqlalchemy/orm/query.py", line 3332, in _connection_from_session
    conn = self.session.connection(**kw)
  File "/usr/lib64/python3.6/site-packages/sqlalchemy/orm/session.py", line 1123, in connection
    execution_options=execution_options,
  File "/usr/lib64/python3.6/site-packages/sqlalchemy/orm/session.py", line 1129, in _connection_for_bind
    engine, execution_options
  File "/usr/lib64/python3.6/site-packages/sqlalchemy/orm/session.py", line 407, in _connection_for_bind
    self._assert_active()
  File "/usr/lib64/python3.6/site-packages/sqlalchemy/orm/session.py", line 294, in _assert_active
    % self._rollback_exception
sqlalchemy.exc.InvalidRequestError: This Session's transaction has been rolled back due to a previous exception during flush. To begin a new transaction with this Session, first issue Session.rollback(). Original exception was: (MySQLdb._exceptions.IntegrityError) (1062, "Duplicate entry 'atlas-cern-hephy-984aebac-54f3-4c1f-987f-f3d1f2c1b782' for key 'PRIMARY'")
[SQL: INSERT INTO csv2_vms (group_name, cloud_name, target_alias, region, vmid, instance_id, cloud_type, vm_ips, vm_floating_ips, auth_url, project, hostname, status, flavor_id, image_id, task, power_status, htcondor_startd_errors, htcondor_startd_time, htcondor_partitionable_slots, htcondor_dynamic_slots, htcondor_slots_timestamp, retire_time, terminate_time, status_changed_time, last_updated, updater) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)]
[parameters: ('atlas-cern', 'hephy', None, 'RegionOne', '984aebac-54f3-4c1f-987f-f3d1f2c1b782', None, 'openstack', "['192.168.1.5']", '[]', 'https://balrog-ctl.uibk.ac.at:5000/v3', 'atlas', 'atlas-cern--hephy--1782--119318012926408', 'ACTIVE', '461aa9c8-5de2-48f8-a6ff-939596b22fa7', 'eff5a8e7-3408-4684-b623-9fd7ed48dd31', 'deleting', 1, None, None, None, None, None, None, None, None, 1581441564, None)]
(Background on this error at: http://sqlalche.me/e/gkpj)
2020-02-11 09:19:32,324 - VM Poller    - 976254 - ERROR - This Session's transaction has been rolled back due to a previous exception during flush. To begin a new transaction with this Session, first issue Session.rollback(). Original exception was: (MySQLdb._exceptions.IntegrityError) (1062, "Duplicate entry 'atlas-cern-hephy-984aebac-54f3-4c1f-987f-f3d1f2c1b782' for key 'PRIMARY'")
[SQL: INSERT INTO csv2_vms (group_name, cloud_name, target_alias, region, vmid, instance_id, cloud_type, vm_ips, vm_floating_ips, auth_url, project, hostname, status, flavor_id, image_id, task, power_status, htcondor_startd_errors, htcondor_startd_time, htcondor_partitionable_slots, htcondor_dynamic_slots, htcondor_slots_timestamp, retire_time, terminate_time, status_changed_time, last_updated, updater) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)]
[parameters: ('atlas-cern', 'hephy', None, 'RegionOne', '984aebac-54f3-4c1f-987f-f3d1f2c1b782', None, 'openstack', "['192.168.1.5']", '[]', 'https://balrog-ctl.uibk.ac.at:5000/v3', 'atlas', 'atlas-cern--hephy--1782--119318012926408', 'ACTIVE', '461aa9c8-5de2-48f8-a6ff-939596b22fa7', 'eff5a8e7-3408-4684-b623-9fd7ed48dd31', 'deleting', 1, None, None, None, None, None, None, None, None, 1581441564, None)]
(Background on this error at: http://sqlalche.me/e/gkpj)
2020-02-11 09:19:35,234 - MainProcess  - 502884 - ERROR - vm process died, restarting...

We did trigger this error because we deleted the hephy VMs from the database, but we should still do a try except around the insertion.

rseuster commented 4 years ago

closing