stadt-karlsruhe / ckanext-extractor

A full text and metadata extractor for CKAN
GNU Affero General Public License v3.0
19 stars 16 forks source link

How do we handle resources uploaded via datastore_create? #13

Open mystycs opened 6 years ago

mystycs commented 6 years ago

How do we handle resources that are uploaded via API to datastore_create? It just says the following error below that it cant find the URL:


2018-09-07 20:15:41,106 INFO  [rq.worker] ckan:default:default: ckanext.extractor.tasks.extract('/etc/ckan/default/production.ini', {u'cache_last_updated': None, u'package_id': u'045626ce-96c5-4eb5-a248-d3b1e5a9eb2f', u'datastore_active': True, u'id': u'ce1b08ed-51e2-4ec8-9eb9-1bb0bec237e9', u'size': None, u'restricted': u'{"allowed_users": "", "level": "public"}', u'state': u'active', u'hash': u'', u'description': u'', u'format': u'data dictionary', u'mimetype_inner': None, u'url_type': None, u'mimetype': None, u'cache_url': None, u'name': u'rees', u'created': '2018-09-07T20:14:21.774267', u'url': u'', u'last_modified': None, u'position': 7, u'revision_id': u'11d499ed-19f0-491c-a2bb-482b74c3cdca', u'tag_string_resource': u'', u'resource_type': u''}) (a92631fc-7197-416e-bd0e-1df1b7d5e421)
2018-09-07 20:15:41,109 INFO  [ckan.lib.jobs] Worker rq:worker:MECALDDMPCKN01.19289 starts job a92631fc-7197-416e-bd0e-1df1b7d5e421 from queue "default"
2018-09-07 20:15:44,209 DEBUG [ckanext.extractor.model] Resource metadata table already defined
2018-09-07 20:15:44,209 DEBUG [ckanext.extractor.model] Resource metadatum table already defined
2018-09-07 20:15:47,618 DEBUG [ckanext.extractor.model] Resource metadata table already defined
2018-09-07 20:15:47,618 DEBUG [ckanext.extractor.model] Resource metadatum table already defined
2018-09-07 20:15:49,236 WARNI [ckanext.extractor.tasks] Failed to download resource data from "": Invalid URL '': No schema supplied. Perhaps you meant http://?
2018-09-07 20:15:49,306 DEBUG [ckanext.extractor.logic.action] extractor_show 53fc14dd-3ffb-4407-88ea-a66feeef87e0
2018-09-07 20:15:49,314 DEBUG [ckanext.extractor.logic.action] extractor_show 990d609a-1231-4ed8-8d95-a2e53311cf6d
2018-09-07 20:15:49,330 DEBUG [ckanext.extractor.logic.action] extractor_show 9de52d49-2e53-45fb-b5be-c287b10d3cb3
2018-09-07 20:15:49,339 DEBUG [ckanext.extractor.logic.action] extractor_show 849c6583-f795-4de6-89b8-b5130fb1e3e9
2018-09-07 20:15:49,346 DEBUG [ckanext.extractor.logic.action] extractor_show 3e67f293-0cbb-417e-949c-7ceb7116829a
2018-09-07 20:15:49,354 DEBUG [ckanext.extractor.logic.action] extractor_show 45b35abb-4f05-4fed-85a5-8b198e58e578
2018-09-07 20:15:49,361 DEBUG [ckanext.extractor.logic.action] extractor_show 207db850-c5e2-4508-887b-4107d8a7684a
2018-09-07 20:15:49,368 DEBUG [ckanext.extractor.logic.action] extractor_show ce1b08ed-51e2-4ec8-9eb9-1bb0bec237e9
torfsen commented 6 years ago

Thanks for your report, @mystycs! As you've just found out, there is currently no support for resources uploaded via datastore_create.

At the very least, that missing support should be documented and the log messages should be more informative.

Obviously, actual support for datastore_create would be even better. Can you tell me what kind of data you're uploading via datastore_create and what kind of metadata you would like to extract from it? That would help me to understand the required functionality better.

torfsen commented 5 years ago

@mystycs I think support for datastore_create would be nice to have, so I'll re-open the ticket to keep it until that support is implemented (or the missing support is documented).

torfsen commented 5 years ago

It seems that as of CKAN 2.8, reliably getting notified about data changes in the DataStore is impossible (see ckan/ckan#4558).

We could still check for the datastore_active flag and for url_type == datastore and fail with an appropriate error message.