okfn / ckanext-lacounts

CKAN extension for the LA Counts project
GNU Affero General Public License v3.0
8 stars 5 forks source link

Normalize socrata harvested datasets #52

Closed roll closed 5 years ago

roll commented 6 years ago

Overview

Harvested metadata will obviously have a different schema than the LA Counts site one. We need a way to match fields in the remote metadata to the ones expected in our site.

Or in this Socrata dataset from this site, the value for response['results']['classification']['domain_metadata']['key'] == 'Data-Freshness_Time-Period' might be our temporal_extent_start / temporal_extent_end fields (need to check this).

Mapping cheat sheet - https://github.com/okfn/ckanext-lacounts/issues/51#issuecomment-418265966

Tasks

brew commented 6 years ago

Added the two Socrata sources named above. I've added publishers but they may not be exactly the ones wanted for production.

brew commented 6 years ago

Socrata API returns some domain specific key/values in the classification property, including a domain_metadata key. Socrata docs describe it as:

an array of domain metadata objects for public custom metadata

Each item is a key/value can be added as custom metadata specific to the domain (https://socratadiscovery.docs.apiary.io/#reference/0/find-by-domain-specific-metadata).

Each domain has the ability to add custom metadata to datasets beyond Socrata’s default metadata. This custom metadata is different for every domain, but within a domain, all assets may be labeled with the metadata. The custom metadata is a named set of key-value pairs. For example one domain might have a set named 'Publication Metadata' and have keys 'Publication Date' and 'Publication Cycle', while another domain has a set named 'Agency Ownership' having key 'Department'). The caller may restrict the results to a particular custom metadata pair by specifying the parameter name as a combination of the set's name and the key's name and the parameter value as the key's value. To construct the parameter name join the set's name to the key's name with an underscore and replace all spaces with dashes.

So, we can't rely on each Socrata instance having the same custom metadata that we can map to Lacounts data fields.

For now, all domain_metadata items are harvested and added as package extras without further processing.

brew commented 6 years ago

Similarly, the license property is available under the metadata key for Socrata datasets. It's not clear if the provided values are part of a controlled vocabulary. By default in CKAN, license options for a dataset are provided by entries to the Open Licenses Service for a controlled CKAN licenses group (https://licenses.opendefinition.org/licenses/groups/ckan.json). If Socrata metadata.license is free text it maybe difficult to map between these two properties.

For now, if a metadata.license is present in the Socrata data, it's added as a package extra.

amercader commented 6 years ago

Some queries

amercader commented 6 years ago

Some fixes / tweaks on our side:

roll commented 5 years ago

@amercader Please check if we're good now to close this issue: https://lacounts-staging.l3.ckan.io/dataset/calendar-events