mutalyzer / mutalyzer2

HGVS variant nomenclature checker
https://mutalyzer.nl
Other
98 stars 23 forks source link

Updating reference checksum might break integrity #352

Open martijnvermaat opened 8 years ago

martijnvermaat commented 8 years ago

Just now, Mutalyzer on our main server tried to update the MD5 checksum for a NM reference. This failed because there was already a reference (UD) with the new checksum in the database (indeed, it was the same file, presumably uploaded earlier by hand).

The database error occured in the _update_db_md5 method of the retriever module. Not sure what the best course of action would be in this case though.

In this case the result was a bit dramatic, since updating the MD5 checksum was triggered by a batch job and therefore the batch processor got stuck on this entry.

martijnvermaat commented 8 years ago

I just saw a few more occurrences of this. Actually the batch entry that is being processed when the error occurs has already be removed from the queue, so the batch processor will be able to resume with the next entry. But if the next few entries use the same reference (this happened in the situation above), the batch processor service is stopped by systemd due to crashing too many times in a short period of time.

Here's an example:

Traceback (most recent call last):
  File "/opt/mutalyzer/versions/35c35b8/virtualenv/local/lib/python2.7/site-packages/mutalyzer/Scheduler.py", line 440, in _processNameBatch
    variantchecker.check_variant(cmd, O)
  File "/opt/mutalyzer/versions/35c35b8/virtualenv/local/lib/python2.7/site-packages/mutalyzer/variantchecker.py", line 1743, in check_variant
    retrieved_record = retriever.loadrecord(record_id)
  File "/opt/mutalyzer/versions/35c35b8/virtualenv/local/lib/python2.7/site-packages/mutalyzer/Retriever.py", line 771, in loadrecord
    filename = self.fetch(identifier)
  File "/opt/mutalyzer/versions/35c35b8/virtualenv/local/lib/python2.7/site-packages/mutalyzer/Retriever.py", line 413, in fetch
    return self._update_db_md5(raw_data, name, gi)
  File "/opt/mutalyzer/versions/35c35b8/virtualenv/local/lib/python2.7/site-packages/mutalyzer/Retriever.py", line 156, in _update_db_md5
    {'checksum': md5sum})
  File "/opt/mutalyzer/versions/35c35b8/virtualenv/local/lib/python2.7/site-packages/sqlalchemy/orm/query.py", line 3005, in update
    update_op.exec_()
  File "/opt/mutalyzer/versions/35c35b8/virtualenv/local/lib/python2.7/site-packages/sqlalchemy/orm/persistence.py", line 1112, in exec_
    self._do_exec()
  File "/opt/mutalyzer/versions/35c35b8/virtualenv/local/lib/python2.7/site-packages/sqlalchemy/orm/persistence.py", line 1261, in _do_exec
    mapper=self.mapper)
  File "/opt/mutalyzer/versions/35c35b8/virtualenv/local/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 1034, in execute
    bind, close_with_result=True).execute(clause, params or {})
  File "/opt/mutalyzer/versions/35c35b8/virtualenv/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 914, in execute
    return meth(self, multiparams, params)
  File "/opt/mutalyzer/versions/35c35b8/virtualenv/local/lib/python2.7/site-packages/sqlalchemy/sql/elements.py", line 323, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "/opt/mutalyzer/versions/35c35b8/virtualenv/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1010, in _execute_clauseelement
    compiled_sql, distilled_params
  File "/opt/mutalyzer/versions/35c35b8/virtualenv/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1146, in _execute_context
    context)
  File "/opt/mutalyzer/versions/35c35b8/virtualenv/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1341, in _handle_dbapi_exception
    exc_info
  File "/opt/mutalyzer/versions/35c35b8/virtualenv/local/lib/python2.7/site-packages/sqlalchemy/util/compat.py", line 199, in raise_from_cause
    reraise(type(exception), exception, tb=exc_tb)
  File "/opt/mutalyzer/versions/35c35b8/virtualenv/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1139, in _execute_context
    context)
  File "/opt/mutalyzer/versions/35c35b8/virtualenv/local/lib/python2.7/site-packages/sqlalchemy/engine/default.py", line 450, in do_execute
    cursor.execute(statement, parameters)
IntegrityError: (psycopg2.IntegrityError) duplicate key value violates unique constraint "ix_references_checksum"
DETAIL:  Key (checksum)=(fb263a5e992d38a549882889d14f5912) already exists.
 [SQL: 'UPDATE "references" SET checksum=%(checksum)s WHERE "references".accession = %(accession_1)s'] [parameters: {'checksum': u'fb263a5e992d38a549882889d14f5912', 'accession_1': u'NM_000022.2'}]
martijnvermaat commented 8 years ago

Some large batch name checker jobs seem to trigger this error quite often over the last few days. Here's another example:

Traceback (most recent call last):
  File "/opt/mutalyzer/versions/35c35b8/virtualenv/local/lib/python2.7/site-packages/mutalyzer/Scheduler.py", line 440, in _processNameBatch
    variantchecker.check_variant(cmd, O)
  File "/opt/mutalyzer/versions/35c35b8/virtualenv/local/lib/python2.7/site-packages/mutalyzer/variantchecker.py", line 1743, in check_variant
    retrieved_record = retriever.loadrecord(record_id)
  File "/opt/mutalyzer/versions/35c35b8/virtualenv/local/lib/python2.7/site-packages/mutalyzer/Retriever.py", line 771, in loadrecord
    filename = self.fetch(identifier)
  File "/opt/mutalyzer/versions/35c35b8/virtualenv/local/lib/python2.7/site-packages/mutalyzer/Retriever.py", line 413, in fetch
    return self._update_db_md5(raw_data, name, gi)
  File "/opt/mutalyzer/versions/35c35b8/virtualenv/local/lib/python2.7/site-packages/mutalyzer/Retriever.py", line 156, in _update_db_md5
    {'checksum': md5sum})
  File "/opt/mutalyzer/versions/35c35b8/virtualenv/local/lib/python2.7/site-packages/sqlalchemy/orm/query.py", line 3005, in update
    update_op.exec_()
  File "/opt/mutalyzer/versions/35c35b8/virtualenv/local/lib/python2.7/site-packages/sqlalchemy/orm/persistence.py", line 1112, in exec_
    self._do_exec()
  File "/opt/mutalyzer/versions/35c35b8/virtualenv/local/lib/python2.7/site-packages/sqlalchemy/orm/persistence.py", line 1261, in _do_exec
    mapper=self.mapper)
  File "/opt/mutalyzer/versions/35c35b8/virtualenv/local/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 1034, in execute
    bind, close_with_result=True).execute(clause, params or {})
  File "/opt/mutalyzer/versions/35c35b8/virtualenv/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 914, in execute
    return meth(self, multiparams, params)
  File "/opt/mutalyzer/versions/35c35b8/virtualenv/local/lib/python2.7/site-packages/sqlalchemy/sql/elements.py", line 323, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "/opt/mutalyzer/versions/35c35b8/virtualenv/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1010, in _execute_clauseelement
    compiled_sql, distilled_params
  File "/opt/mutalyzer/versions/35c35b8/virtualenv/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1146, in _execute_context
    context)
  File "/opt/mutalyzer/versions/35c35b8/virtualenv/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1341, in _handle_dbapi_exception
    exc_info
  File "/opt/mutalyzer/versions/35c35b8/virtualenv/local/lib/python2.7/site-packages/sqlalchemy/util/compat.py", line 199, in raise_from_cause
    reraise(type(exception), exception, tb=exc_tb)
  File "/opt/mutalyzer/versions/35c35b8/virtualenv/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1139, in _execute_context
    context)
  File "/opt/mutalyzer/versions/35c35b8/virtualenv/local/lib/python2.7/site-packages/sqlalchemy/engine/default.py", line 450, in do_execute
    cursor.execute(statement, parameters)
IntegrityError: (psycopg2.IntegrityError) duplicate key value violates unique constraint "ix_references_checksum"
DETAIL:  Key (checksum)=(fb263a5e992d38a549882889d14f5912) already exists.
 [SQL: 'UPDATE "references" SET checksum=%(checksum)s WHERE "references".accession = %(accession_1)s'] [parameters: {'checksum': u'fb263a5e992d38a549882889d14f5912', 'accession_1': u'NM_000022.2'}]

I guess these are NM references for which an old version was in the cache. The new version has been uploaded manually (now a UD entry), but now Mutalyzer tries to update the NM to the new version but that checksum already exists.

Indeed, for this example the NM was originally added in 2011, while the UD was added in 2015. Same for the other example. The UD entries don't have a download url or slice info, so they were uploaded.

martijnvermaat commented 8 years ago

We should update the cache with the new reference that is being downloaded (NM in these examples), but we cannot throw away the other record with the same checksum (UD in these examples).

I think the easiest way to solve this is to drop the unique constraint on the checksum. There are two downsides to this:

  1. Conceptually I'd say we should prevent having duplicate content under different names in the cache.
  2. When a reference is uploaded by file or downloaded by URL (via the reference loader), Mutalyzer returns an existing accession number if it has seen it before. It does this by comparing the checksum. With this change, there can be multiple such existing entries, so you wouldn't necessarily get the same UD if you uploaded this file before.

@jfjlaros What do you think?

martijnvermaat commented 8 years ago

GitHub doesn't properly understand English, this should not be closed yet.

martijnvermaat commented 8 years ago

Same issue for M61857.1 as reported in #378. There is already an UD in the database with the same checksum. Mutalyzer doesn't see this (as it first queries by accession), try to add this reference, causing an integrity error on checksum uniqueness.