Closed colsond closed 4 years ago
once we established connection to hephy again, we see this crash:
2020-02-11 09:19:29,612 - VM Poller - 976254 - INFO - VM updates committed: 1
2020-02-11 09:19:32,317 - VM Poller - 976254 - ERROR - VM poller cycle while loop exception, process terminating...
Traceback (most recent call last):
File "/usr/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 1244, in _execute_context
cursor, statement, parameters, context
File "/usr/lib64/python3.6/site-packages/sqlalchemy/engine/default.py", line 552, in do_execute
cursor.execute(statement, parameters)
File "/usr/lib64/python3.6/site-packages/MySQLdb/cursors.py", line 206, in execute
res = self._query(query)
File "/usr/lib64/python3.6/site-packages/MySQLdb/cursors.py", line 312, in _query
db.query(q)
File "/usr/lib64/python3.6/site-packages/MySQLdb/connections.py", line 224, in query
_mysql.connection.query(self, query)
MySQLdb._exceptions.IntegrityError: (1062, "Duplicate entry 'atlas-cern-hephy-984aebac-54f3-4c1f-987f-f3d1f2c1b782' for key 'PRIMARY'")
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/opt/cloudscheduler/data_collectors/openstack/openstackPoller.py", line 1461, in vm_poller
db_session.commit()
File "/usr/lib64/python3.6/site-packages/sqlalchemy/orm/session.py", line 1026, in commit
self.transaction.commit()
File "/usr/lib64/python3.6/site-packages/sqlalchemy/orm/session.py", line 493, in commit
self._prepare_impl()
File "/usr/lib64/python3.6/site-packages/sqlalchemy/orm/session.py", line 472, in _prepare_impl
self.session.flush()
File "/usr/lib64/python3.6/site-packages/sqlalchemy/orm/session.py", line 2451, in flush
self._flush(objects)
File "/usr/lib64/python3.6/site-packages/sqlalchemy/orm/session.py", line 2589, in _flush
transaction.rollback(_capture_exception=True)
File "/usr/lib64/python3.6/site-packages/sqlalchemy/util/langhelpers.py", line 68, in __exit__
compat.reraise(exc_type, exc_value, exc_tb)
File "/usr/lib64/python3.6/site-packages/sqlalchemy/util/compat.py", line 129, in reraise
raise value
File "/usr/lib64/python3.6/site-packages/sqlalchemy/orm/session.py", line 2549, in _flush
flush_context.execute()
File "/usr/lib64/python3.6/site-packages/sqlalchemy/orm/unitofwork.py", line 422, in execute
rec.execute(self)
File "/usr/lib64/python3.6/site-packages/sqlalchemy/orm/unitofwork.py", line 589, in execute
uow,
File "/usr/lib64/python3.6/site-packages/sqlalchemy/orm/persistence.py", line 245, in save_obj
insert,
File "/usr/lib64/python3.6/site-packages/sqlalchemy/orm/persistence.py", line 1066, in _emit_insert_statements
c = cached_connections[connection].execute(statement, multiparams)
File "/usr/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 988, in execute
return meth(self, multiparams, params)
File "/usr/lib64/python3.6/site-packages/sqlalchemy/sql/elements.py", line 287, in _execute_on_connection
return connection._execute_clauseelement(self, multiparams, params)
File "/usr/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 1107, in _execute_clauseelement
distilled_params,
File "/usr/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 1248, in _execute_context
e, statement, parameters, cursor, context
File "/usr/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 1466, in _handle_dbapi_exception
util.raise_from_cause(sqlalchemy_exception, exc_info)
File "/usr/lib64/python3.6/site-packages/sqlalchemy/util/compat.py", line 383, in raise_from_cause
reraise(type(exception), exception, tb=exc_tb, cause=cause)
File "/usr/lib64/python3.6/site-packages/sqlalchemy/util/compat.py", line 128, in reraise
raise value.with_traceback(tb)
File "/usr/lib64/python3.6/site-packages/sqlalchemy/engine/base.py", line 1244, in _execute_context
cursor, statement, parameters, context
File "/usr/lib64/python3.6/site-packages/sqlalchemy/engine/default.py", line 552, in do_execute
cursor.execute(statement, parameters)
File "/usr/lib64/python3.6/site-packages/MySQLdb/cursors.py", line 206, in execute
res = self._query(query)
File "/usr/lib64/python3.6/site-packages/MySQLdb/cursors.py", line 312, in _query
db.query(q)
File "/usr/lib64/python3.6/site-packages/MySQLdb/connections.py", line 224, in query
_mysql.connection.query(self, query)
sqlalchemy.exc.IntegrityError: (MySQLdb._exceptions.IntegrityError) (1062, "Duplicate entry 'atlas-cern-hephy-984aebac-54f3-4c1f-987f-f3d1f2c1b782' for key 'PRIMARY'")
[SQL: INSERT INTO csv2_vms (group_name, cloud_name, target_alias, region, vmid, instance_id, cloud_type, vm_ips, vm_floating_ips, auth_url, project, hostname, status, flavor_id, image_id, task, power_status, htcondor_startd_errors, htcondor_startd_time, htcondor_partitionable_slots, htcondor_dynamic_slots, htcondor_slots_timestamp, retire_time, terminate_time, status_changed_time, last_updated, updater) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)]
[parameters: ('atlas-cern', 'hephy', None, 'RegionOne', '984aebac-54f3-4c1f-987f-f3d1f2c1b782', None, 'openstack', "['192.168.1.5']", '[]', 'https://balrog-ctl.uibk.ac.at:5000/v3', 'atlas', 'atlas-cern--hephy--1782--119318012926408', 'ACTIVE', '461aa9c8-5de2-48f8-a6ff-939596b22fa7', 'eff5a8e7-3408-4684-b623-9fd7ed48dd31', 'deleting', 1, None, None, None, None, None, None, None, None, 1581441564, None)]
(Background on this error at: http://sqlalche.me/e/gkpj)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/cloudscheduler/data_collectors/openstack/openstackPoller.py", line 1464, in vm_poller
logging.exception("Failed to commit VM updates for %s::%s:%s, using group %s's credentials aborting cycle..." % (cloud_obj.authurl, cloud_obj.project, cloud_obj.region, cloud_obj.group_name))
File "/usr/lib64/python3.6/site-packages/sqlalchemy/orm/attributes.py", line 276, in __get__
return self.impl.get(instance_state(instance), dict_)
File "/usr/lib64/python3.6/site-packages/sqlalchemy/orm/attributes.py", line 677, in get
value = state._load_expired(state, passive)
File "/usr/lib64/python3.6/site-packages/sqlalchemy/orm/state.py", line 660, in _load_expired
self.manager.deferred_scalar_loader(self, toload)
File "/usr/lib64/python3.6/site-packages/sqlalchemy/orm/loading.py", line 979, in load_scalar_attributes
only_load_props=attribute_names,
File "/usr/lib64/python3.6/site-packages/sqlalchemy/orm/loading.py", line 208, in load_on_ident
identity_token=identity_token,
File "/usr/lib64/python3.6/site-packages/sqlalchemy/orm/loading.py", line 282, in load_on_pk_identity
return q.one()
File "/usr/lib64/python3.6/site-packages/sqlalchemy/orm/query.py", line 3275, in one
ret = self.one_or_none()
File "/usr/lib64/python3.6/site-packages/sqlalchemy/orm/query.py", line 3244, in one_or_none
ret = list(self)
File "/usr/lib64/python3.6/site-packages/sqlalchemy/orm/query.py", line 3317, in __iter__
return self._execute_and_instances(context)
File "/usr/lib64/python3.6/site-packages/sqlalchemy/orm/query.py", line 3339, in _execute_and_instances
querycontext, self._connection_from_session, close_with_result=True
File "/usr/lib64/python3.6/site-packages/sqlalchemy/orm/query.py", line 3354, in _get_bind_args
mapper=self._bind_mapper(), clause=querycontext.statement, **kw
File "/usr/lib64/python3.6/site-packages/sqlalchemy/orm/query.py", line 3332, in _connection_from_session
conn = self.session.connection(**kw)
File "/usr/lib64/python3.6/site-packages/sqlalchemy/orm/session.py", line 1123, in connection
execution_options=execution_options,
File "/usr/lib64/python3.6/site-packages/sqlalchemy/orm/session.py", line 1129, in _connection_for_bind
engine, execution_options
File "/usr/lib64/python3.6/site-packages/sqlalchemy/orm/session.py", line 407, in _connection_for_bind
self._assert_active()
File "/usr/lib64/python3.6/site-packages/sqlalchemy/orm/session.py", line 294, in _assert_active
% self._rollback_exception
sqlalchemy.exc.InvalidRequestError: This Session's transaction has been rolled back due to a previous exception during flush. To begin a new transaction with this Session, first issue Session.rollback(). Original exception was: (MySQLdb._exceptions.IntegrityError) (1062, "Duplicate entry 'atlas-cern-hephy-984aebac-54f3-4c1f-987f-f3d1f2c1b782' for key 'PRIMARY'")
[SQL: INSERT INTO csv2_vms (group_name, cloud_name, target_alias, region, vmid, instance_id, cloud_type, vm_ips, vm_floating_ips, auth_url, project, hostname, status, flavor_id, image_id, task, power_status, htcondor_startd_errors, htcondor_startd_time, htcondor_partitionable_slots, htcondor_dynamic_slots, htcondor_slots_timestamp, retire_time, terminate_time, status_changed_time, last_updated, updater) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)]
[parameters: ('atlas-cern', 'hephy', None, 'RegionOne', '984aebac-54f3-4c1f-987f-f3d1f2c1b782', None, 'openstack', "['192.168.1.5']", '[]', 'https://balrog-ctl.uibk.ac.at:5000/v3', 'atlas', 'atlas-cern--hephy--1782--119318012926408', 'ACTIVE', '461aa9c8-5de2-48f8-a6ff-939596b22fa7', 'eff5a8e7-3408-4684-b623-9fd7ed48dd31', 'deleting', 1, None, None, None, None, None, None, None, None, 1581441564, None)]
(Background on this error at: http://sqlalche.me/e/gkpj)
2020-02-11 09:19:32,324 - VM Poller - 976254 - ERROR - This Session's transaction has been rolled back due to a previous exception during flush. To begin a new transaction with this Session, first issue Session.rollback(). Original exception was: (MySQLdb._exceptions.IntegrityError) (1062, "Duplicate entry 'atlas-cern-hephy-984aebac-54f3-4c1f-987f-f3d1f2c1b782' for key 'PRIMARY'")
[SQL: INSERT INTO csv2_vms (group_name, cloud_name, target_alias, region, vmid, instance_id, cloud_type, vm_ips, vm_floating_ips, auth_url, project, hostname, status, flavor_id, image_id, task, power_status, htcondor_startd_errors, htcondor_startd_time, htcondor_partitionable_slots, htcondor_dynamic_slots, htcondor_slots_timestamp, retire_time, terminate_time, status_changed_time, last_updated, updater) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)]
[parameters: ('atlas-cern', 'hephy', None, 'RegionOne', '984aebac-54f3-4c1f-987f-f3d1f2c1b782', None, 'openstack', "['192.168.1.5']", '[]', 'https://balrog-ctl.uibk.ac.at:5000/v3', 'atlas', 'atlas-cern--hephy--1782--119318012926408', 'ACTIVE', '461aa9c8-5de2-48f8-a6ff-939596b22fa7', 'eff5a8e7-3408-4684-b623-9fd7ed48dd31', 'deleting', 1, None, None, None, None, None, None, None, None, 1581441564, None)]
(Background on this error at: http://sqlalche.me/e/gkpj)
2020-02-11 09:19:35,234 - MainProcess - 502884 - ERROR - vm process died, restarting...
We did trigger this error because we deleted the hephy VMs from the database, but we should still do a try except around the insertion.
closing
On production we had an issue where one of the clouds became unreachable while there was a bunch of VMs still active. csv2 was unable to terminate these machines and the poller was getting constant timeouts greatly increasing the cycle time slowing down operation for other VMs.
We need to come up with a strategy that will identify unreachable VMs (or perhaps clouds) and temporarily skip them instead of getting hundreds of timeouts leaving other vm operations waiting for potential hours.
One solution may be to keep a local record of VM hosts that timeout/fail and only issue commands to these problem VMs every 10th cycle. Another strategy could be to detect when a cloud is unreachable and maintain a list somewhere in the csv2 database accessible to the pollers so they know which clouds to skip.