mutalyzer / mutalyzer2

HGVS variant nomenclature checker
https://mutalyzer.nl
Other
98 stars 23 forks source link

Fix for unexpected batch processor crash #426

Closed mihailefter closed 7 years ago

mihailefter commented 7 years ago

Problem

It looks like the batch processor crashes in the __alterBatchEntries function. Considering an input file with the following contents:

During the processing of the first job (which proceeds without crashing) Mutalyzer fetches the most recent version for NM_024690, which is NM_024690.2. Next it tries to update any other entries in the batch_queue_items database table which utilize only NM_024690 to the most recent version. This is done in order to speed up the batch process when those jobs are reached. In this case it tries to update the second job. The information stored in the item column of the batch_queue_items table for the second job has 200 characters, which is to be replaced by a larger one, of 202 characters. Since this is greater than the maximum allowed, the query results in an error:

(psycopg2.DataError) value too long for type character varying(200)

It seems that an input line is automatically truncated to 200 characters when added to the database, so no error appears there, but during the replace operation the truncation is no longer performed.

Possible solutions

  1. Change the item column type in batch_queue_items table to a variable unlimited length type. This is supported by PostgreSQL as type text but didn't check for other SQL database management systems.
  2. Don't process (skip) entries which are longer than 200 chars and do not perform the replacement query on them.
mihailefter commented 7 years ago

We went for the second solution in order not to alter the database. We implemented the following:

mihailefter commented 7 years ago

We also discovered that "__alterBatchEntries" changes the input sequence when one accession is a substring of another accession which appears later in the file, since the first part of the latest one will be replaced.

Input file sequence:

NM_001315:c.723_723delinsAC
NM_001315507:c.1826_1831G

During the processing of NM_001315:c.723_723delinsAC the NM_001315507:c.1826_1831G is replaced to NM_001315.2507:c.1826_1831G.