openzim / wp1

Wikipedia 1.0 engine & selection tools
https://wp1.openzim.org
GNU General Public License v2.0
24 stars 17 forks source link

Update error, duplicate key issue #101

Open audiodude opened 5 years ago

audiodude commented 5 years ago

When updating project 'Football', we got the following stack trace:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/rq/worker.py", line 822, in perform_job
    rv = job.perform()
  File "/usr/local/lib/python3.7/site-packages/rq/job.py", line 605, in perform
    self._result = self._execute()
  File "/usr/local/lib/python3.7/site-packages/rq/job.py", line 611, in _execute
    return self.func(*self.args, **self.kwargs)
  File "./wp1/logic/project.py", line 68, in update_project_by_name
    update_project(wikidb, wp10db, project)
  File "./wp1/logic/project.py", line 468, in update_project
    extra_assessments['extra'])
  File "./wp1/logic/project.py", line 247, in update_project_assessments
    process_unseen_articles(wikidb, wp10db, project, old_ratings, seen)
  File "./wp1/logic/project.py", line 357, in process_unseen_articles
    move_data['timestamp_dt'])
  File "./wp1/logic/page.py", line 53, in update_page_moved
    logic_move.insert(wp10db, new_move)
  File "./wp1/logic/move.py", line 34, in insert
    ''', attr.asdict(move))
  File "/usr/local/lib/python3.7/site-packages/pymysql/cursors.py", line 170, in execute
    result = self._query(query)
  File "/usr/local/lib/python3.7/site-packages/pymysql/cursors.py", line 328, in _query
    conn.query(q)
  File "/usr/local/lib/python3.7/site-packages/pymysql/connections.py", line 517, in query
    self._affected_rows = self._read_query_result(unbuffered=unbuffered)
  File "/usr/local/lib/python3.7/site-packages/pymysql/connections.py", line 732, in _read_query_result
    result.read()
  File "/usr/local/lib/python3.7/site-packages/pymysql/connections.py", line 1075, in read
    first_packet = self.connection._read_packet()
  File "/usr/local/lib/python3.7/site-packages/pymysql/connections.py", line 684, in _read_packet
    packet.check_error()
  File "/usr/local/lib/python3.7/site-packages/pymysql/protocol.py", line 220, in check_error
    err.raise_mysql_exception(self._data)
  File "/usr/local/lib/python3.7/site-packages/pymysql/err.py", line 109, in raise_mysql_exception
    raise errorclass(errno, errval)
pymysql.err.IntegrityError: (1062, "Duplicate entry '0-Andrei_Ra\\xC5\\xA3iu-2019-08-31T08:58:55Z' for key 'PRIMARY'")

This looks like a move insert that had already been processed. We're suppposed to be doing insert_or_update, but there must be a flaw in the logic.

kelson42 commented 5 years ago

@audiodude Still a bug?

audiodude commented 5 years ago

@kelson42 I haven't seen this one happen recently. We could close it as not reproducible, and wait for it to happen again to re-open.

audiodude commented 5 years ago

And of course, as I look in the logs I find the following (Murphy's Law):

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/rq/worker.py", line 822, in perform_job
    rv = job.perform()
  File "/usr/local/lib/python3.7/site-packages/rq/job.py", line 605, in perform
    self._result = self._execute()
  File "/usr/local/lib/python3.7/site-packages/rq/job.py", line 611, in _execute
    return self.func(*self.args, **self.kwargs)
  File "./wp1/logic/project.py", line 75, in update_project_by_name
    update_project(wikidb, wp10db, project)
  File "./wp1/logic/project.py", line 505, in update_project
    update_project_assessments(wikidb, wp10db, project, extra_assessments)
  File "./wp1/logic/project.py", line 269, in update_project_assessments
    process_unseen_articles(wikidb, wp10db, project, old_ratings, seen)
  File "./wp1/logic/project.py", line 395, in process_unseen_articles
    move_data['timestamp_dt'])
  File "./wp1/logic/page.py", line 53, in update_page_moved
    logic_move.insert(wp10db, new_move)
  File "./wp1/logic/move.py", line 34, in insert
    ''', attr.asdict(move))
  File "/usr/local/lib/python3.7/site-packages/pymysql/cursors.py", line 170, in execute
    result = self._query(query)
  File "/usr/local/lib/python3.7/site-packages/pymysql/cursors.py", line 328, in _query
    conn.query(q)
  File "/usr/local/lib/python3.7/site-packages/pymysql/connections.py", line 517, in query
    self._affected_rows = self._read_query_result(unbuffered=unbuffered)
  File "/usr/local/lib/python3.7/site-packages/pymysql/connections.py", line 732, in _read_query_result
    result.read()
  File "/usr/local/lib/python3.7/site-packages/pymysql/connections.py", line 1075, in read
    first_packet = self.connection._read_packet()
  File "/usr/local/lib/python3.7/site-packages/pymysql/connections.py", line 684, in _read_packet
    packet.check_error()
  File "/usr/local/lib/python3.7/site-packages/pymysql/protocol.py", line 220, in check_error
    err.raise_mysql_exception(self._data)
  File "/usr/local/lib/python3.7/site-packages/pymysql/err.py", line 109, in raise_mysql_exception
    raise errorclass(errno, errval)
pymysql.err.IntegrityError: (1062, "Duplicate entry '0-Vishal\\xE2\\x80\\x93Shekhar-2019-10-18T16:46:32Z' for key 'PRIMARY'")
audiodude commented 5 years ago

It looks like it has something to do with non-ASCII characters

kelson42 commented 4 years ago

This bug is going to be a blocker I believe if we want to consider using the WP1 engine with a few other Wikipedias. Probably the test should be extended to secure the WP1 engine can deal properly with accented characerts. With MySQL usually a few things need to be checket: