ic-hep / pdm

0 stars 0 forks source link

More database problems updating jobs #421

Open sfayer opened 6 years ago

sfayer commented 6 years ago

We got this stack trace which testing the BY_SIZE algorithm (possibly multiple worker instances updating things at the same time or something like that)?

'2018-11-09 16:20:22,015 pdm.utils.db ERROR Error updating job
Traceback (most recent call last):
  File "/home/pdm/pdm/src/pdm/utils/db.py", line 23, in managed_session
    yield request.db.session
  File "/home/pdm/pdm/src/pdm/workqueue/WorkqueueDB.py", line 256, in update
    session.merge(self)
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/scoping.py", line 150, in do
    return getattr(self.registry(), name)(*args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 1604, in merge
    self._autoflush()
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 1198, in _autoflush
    self.flush()
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 1919, in flush
    self._flush(objects)
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 2037, in _flush
    transaction.rollback(_capture_exception=True)
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/util/langhelpers.py", line 60, in __exit__
    compat.reraise(exc_type, exc_value, exc_tb)
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 2001, in _flush
    flush_context.execute()
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/unitofwork.py", line 372, in execute
    rec.execute(self)
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/unitofwork.py", line 526, in execute
    uow
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/persistence.py", line 46, in save_obj
    uowtransaction)
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/persistence.py", line 171, in _organize_states_for_save
    state_str(existing)))
FlushError: New instance <JobElement at 0x7f0cb5f5e690> with identity key (<class 'pdm.workqueue.WorkqueueDB.JobElement'>, (1, 417)) conflicts with persistent instance <JobElement at 0x7f0cb6812a90>
'2018-11-09 16:20:22,092 server/workqueue INFO 80242100-9eec-4e48-be95-7d2568163ef7: 146.179.245.46 PUT https://pdm00.grid.hep.ph.ic.ac.uk:5446/workqueue/api/v1.0/worker/jobs/417/elements/0 500 154
'2018-11-09 16:20:22,019 server/workqueue INFO 3ef40f34-5221-45cd-837f-48e06dcb94a7: 146.179.245.46 POST https://pdm00.grid.hep.ph.ic.ac.uk:5446/workqueue/api/v1.0/worker/jobs 404 131
'2018-11-09 16:20:22,020 pdm.utils.db ERROR Error updating job element
Traceback (most recent call last):
  File "/home/pdm/pdm/src/pdm/utils/db.py", line 23, in managed_session
    yield request.db.session
  File "/home/pdm/pdm/src/pdm/workqueue/WorkqueueDB.py", line 349, in update
    session.merge(self)
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/scoping.py", line 150, in do
    return getattr(self.registry(), name)(*args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 1604, in merge
    self._autoflush()
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 1208, in _autoflush
    util.raise_from_cause(e)
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/util/compat.py", line 199, in raise_from_cause
    reraise(type(exception), exception, tb=exc_tb)
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 1198, in _autoflush
    self.flush()
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 1919, in flush
    self._flush(objects)
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 2037, in _flush
    transaction.rollback(_capture_exception=True)
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/util/langhelpers.py", line 60, in __exit__
    compat.reraise(exc_type, exc_value, exc_tb)
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/session.py", line 2001, in _flush
    flush_context.execute()
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/unitofwork.py", line 372, in execute
    rec.execute(self)
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/unitofwork.py", line 526, in execute
    uow
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/persistence.py", line 60, in save_obj
    mapper, table, update)
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/orm/persistence.py", line 518, in _emit_update_statements
    execute(statement, params)
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 729, in execute
    return meth(self, multiparams, params)
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/sql/elements.py", line 322, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 826, in _execute_clauseelement
    compiled_sql, distilled_params
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 958, in _execute_context
    context)
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 1159, in _handle_dbapi_exception
    exc_info
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/util/compat.py", line 199, in raise_from_cause
    reraise(type(exception), exception, tb=exc_tb)
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/base.py", line 951, in _execute_context
    context)
  File "/usr/lib64/python2.7/site-packages/sqlalchemy/engine/default.py", line 436, in do_execute
    cursor.execute(statement, parameters)
IntegrityError: (raised as a result of Query-invoked autoflush; consider using a session.no_autoflush block if this flush is occurring prematurely) (IntegrityError) constraint failed u'UPDATE jobelements SET attempts=?, timestamp=? WHERE jobelements.id = ? AND jobelements.job_id = ?' (3, '2018-11-09 16:20:22.020414', 0, 417)
alexanderrichards commented 5 years ago

looks like the db is being updated with a value that fails the constraint, specifically the attempts constraint.

https://github.com/ic-hep/pdm/blob/a98949a6717be2d84e1599918a3a89d94741fb9a/src/pdm/workqueue/WorkqueueDB.py#L287

Where unless specified max_tries defaults to 2: https://github.com/ic-hep/pdm/blob/a98949a6717be2d84e1599918a3a89d94741fb9a/src/pdm/workqueue/WorkqueueDB.py#L298

This could be a race condition... However given we don't see this with the BY_NUMBER algorithm seems that this is to do with the BY_SIZE algorithm. This has been changed extensively in #436 so maybe we should wait and see if this is still an issue after that PR is merged