ioos / ckanext-ioos-theme

IOOS Catalog as a CKAN extension
GNU Affero General Public License v3.0
7 stars 14 forks source link

Records with duplicate names create race conditions and crash harvesters #95

Closed lukecampbell closed 7 years ago

lukecampbell commented 7 years ago

When CKAN creates a package entry for a dataset, it creates a package name based on the title from the XML document. Duplicate names are kind of ok, in that the harvester will query the database to see if the name already exists, if it doesn't then create it, if it does, append a numeral to the end of the number for the next one.

The problem is that if there are concurrent workers, this creates a race condition of multiple workers are processing datasets the same title and it will ultimately lead to a crash of the worker if the right race condition is met:

2016-11-17 13:44:14,654 DEBUG [ckanext.spatial.harvesters.base] No spatial extent defined for this object
2016-11-17 13:44:14,655 INFO  [ckanext.ioos_theme.harvesters.ioos_harvester] Checking for responsible-organisation

Traceback (most recent call last):
  File "/usr/lib/ckan/default/bin/paster", line 11, in <module>
    sys.exit(run())
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/paste/script/command.py", line 102, in run
    invoke(command, command_name, options, args[1:])
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/paste/script/command.py", line 141, in invoke
    exit_code = runner.run(args)
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/paste/script/command.py", line 236, in run
    result = self.command()
  File "/usr/lib/ckan/default/src/ckanext-harvest/ckanext/harvest/commands/harvester.py", line 179, in command
    fetch_callback(consumer, method, header, body)
  File "/usr/lib/ckan/default/src/ckanext-harvest/ckanext/harvest/queue.py", line 386, in fetch_callback
    fetch_and_import_stages(harvester, obj)
  File "/usr/lib/ckan/default/src/ckanext-harvest/ckanext/harvest/queue.py", line 403, in fetch_and_import_stages
    success_import = harvester.import_stage(obj)
  File "/usr/lib/ckan/default/src/ckanext-spatial/ckanext/spatial/harvesters/base.py", line 607, in import_stage
    package_id = p.toolkit.get_action('package_create')(context, package_dict)
  File "/usr/lib/ckan/default/src/ckan/ckan/logic/__init__.py", line 423, in wrapped
    result = _action(context, data_dict, **kw)
  File "/usr/lib/ckan/default/src/ckan/ckan/logic/action/create.py", line 185, in package_create
    pkg = model_save.package_dict_save(data, context)
  File "/usr/lib/ckan/default/src/ckan/ckan/lib/dictization/model_save.py", line 303, in package_dict_save
    package_membership_list_save(pkg_dict.get("groups"), pkg, context)
  File "/usr/lib/ckan/default/src/ckan/ckan/lib/dictization/model_save.py", line 225, in package_membership_list_save
    model.Session.flush()
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/sqlalchemy/orm/scoping.py", line 149, in do
    return getattr(self.registry(), name)(*args, **kwargs)
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 1907, in flush
    self._flush(objects)
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 2025, in _flush
    transaction.rollback(_capture_exception=True)
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/sqlalchemy/util/langhelpers.py", line 57, in __exit__
    compat.reraise(exc_type, exc_value, exc_tb)
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/sqlalchemy/orm/session.py", line 1989, in _flush
    flush_context.execute()
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/sqlalchemy/orm/unitofwork.py", line 371, in execute
    rec.execute(self)
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/sqlalchemy/orm/unitofwork.py", line 524, in execute
    uow
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/sqlalchemy/orm/persistence.py", line 64, in save_obj
    mapper, table, insert)
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/sqlalchemy/orm/persistence.py", line 568, in _emit_insert_statements
    execute(statement, multiparams)
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 727, in execute
    return meth(self, multiparams, params)
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/sqlalchemy/sql/elements.py", line 322, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 824, in _execute_clauseelement
    compiled_sql, distilled_params
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 954, in _execute_context
    context)
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 1116, in _handle_dbapi_exception
    exc_info
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/sqlalchemy/util/compat.py", line 189, in raise_from_cause
    reraise(type(exception), exception, tb=exc_tb)
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/sqlalchemy/engine/base.py", line 947, in _execute_context
    context)
  File "/usr/lib/ckan/default/local/lib/python2.7/site-packages/sqlalchemy/engine/default.py", line 435, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.IntegrityError: (IntegrityError) duplicate key value violates unique constraint "package_name_key"
DETAIL:  Key (name)=(us-national-weather-service-nwstgwmc-forecast-products2) already exists.
 'INSERT INTO package (id, name, title, version, url, author, author_email, maintainer, maintainer_email, notes, license_id, type, owner_org, creator_user_id, metadata_created, metadata_modified, private, state, revision_id) VALUES (%(id)s, %(name)s, %(title)s, %(version)s, %(url)s, %(author)s, %(author_email)s, %(maintainer)s, %(maintainer_email)s, %(notes)s, %(license_id)s, %(type)s, %(owner_org)s, %(creator_user_id)s, %(metadata_created)s, %(metadata_modified)s, %(private)s, %(state)s, %(revision_id)s)' {'maintainer': None, 'name': u'us-national-weather-service-nwstgwmc-forecast-products2', 'metadata_modified': datetime.datetime(2016, 11, 17, 13, 44, 14, 714079), 'title': u'US National Weather Service - NWSTG(WMC) Forecast products', 'url': None, 'notes': u'', 'owner_org': u'2ff3237c-72e4-4929-8dbc-ca00bc2f6b6f', 'private': False, 'maintainer_email': None, 'author_email': None, 'state': u'active', 'version': None, 'author': None, 'creator_user_id': u'f4626c47-1bac-47ed-8b27-3010da7dd7ab', 'license_id': None, 'revision_id': u'38ca87f9-08fb-402f-b569-dd7ef9674649', 'type': u'dataset', 'id': u'f90f3a70-c322-4ae5-bcc4-b88866ca2776', 'metadata_created': datetime.datetime(2016, 11, 17, 13, 44, 14, 714069)}

We need to find a way to ensure titles are unique before the harvesters attempt to create the record.

benjwadams commented 7 years ago

Shouldn't some sort of UUID be enforcing the uniqueness, rather than the title?

lukecampbell commented 7 years ago

I'm not the designer of this software, I don't know what assumptions or why it was done this way. But the issue is that it fails and prevents us from running concurrent harvesters.

benjwadams commented 7 years ago

https://github.com/ckan/ckanext-harvest/blob/master/ckanext/harvest/harvesters/base.py#L59-L84

benjwadams commented 7 years ago

I can't think of an easy way to do this without making some architecture changes, i.e. pushing the name key up to Redis prior to the update. There is, however, a unique key on the name column of the "packages" table.

lukecampbell commented 7 years ago

what about catching the exception and if it's not unique, adding an integer to the title?

Imagine two datasets with the title "Fake Data"

The title key becomes fake-data and when the unique key constraint fails, we would change it to fake-data-1 and try again?

benjwadams commented 7 years ago

There's actually UUID/hex appending logic in the harvesters which should in theory substantially reduce the number of collisions over sequentially incrementing a number (16^5 random vs 999 sequential). I'm going to opt for that path for the time being, but I'd like to also handle the case where a failure does occur by catching the exception and at least logging it.