Open mattcen opened 4 years ago
I've added an arcgis_page
field to appropriate datasets that links to a consistently formatted web page. I haven't yet worked out if there's a consistent way to determine what, if any, license a given dataset has from here; work in progress.
Hmm, interesting approach! I'm a little bit skeptical that all the "CKAN" instances around the world expose exactly the same APIs with the same information in them. My experiences in the past, even just with CKAN's in Australia turned up enough weird edge cases etc.
While it's true that various CKAN portals use addons to augment the metadata fields and data structures for their data, the license fields are standard to CKAN, so are likely going to be consistent across installations (leaving the possibilities as either "present" or "missing" rather "in some other obscure field that we don't know the name of"). I acknowledge though that some metadata on when data has been updated is sometimes put in other odd fields, though.
How fussy do you want to be about license names?
Below is the variety of information from all CKAN data sources regarding CC licenses, and I'm not sure how best to represent the nuance in them all in a single "license" field.
license_id | license_title | license_url |
---|---|---|
cc-0 | CC-0 | http://creativecommons.org/publicdomain/zero/1.0/deed.nl |
cc-by | Attribution (BY 4.0) | https://www.donneesquebec.ca/fr/licence/#cc-by |
cc-by | CC-BY | http://creativecommons.org/licenses/by/4.0/deed.nl |
cc-by | Creative Commons Attribution 3.0 Australia | http://creativecommons.org/licenses/by/3.0/au/ |
cc-by | Creative Commons Attribution | http://creativecommons.org/licenses/by/4.0 |
cc-by | Creative Commons Attribution | http://www.opendefinition.org/licenses/cc-by |
cc-by-2.5 | Creative Commons Attribution 2.5 Australia | http://creativecommons.org/licenses/by/2.5/au/ |
CC-BY-4.0 | Creative Commons Attribution 4.0 | https://creativecommons.org/licenses/by/4.0/ |
I wrote this script which got me 95% of the way there to putting the license information into the source JSON file using the CKAN field names. (I ran it through jq .
afterwards to format it to look like the original file, and needed to hand-hold santiago
and koeln
.)
#!/usr/bin/env python3
import json
import urllib.request
import ssl
import pprint
# Don't do SSL verification because it didn't work right away and I can't be bothered debugging it
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
with open('../sources-out.json') as json_data:
d = json.load(json_data)
for x in d:
if 'ckan_api' in x:
api = x['ckan_api']
with urllib.request.urlopen(api, context=ctx) as url:
data = json.loads(url.read().decode())
if 'result' not in data:
continue
else:
result = data['result']
print(x['id'])
try:
for l in ( 'license_id', 'license_title', 'license_url' ):
if l in result:
x[l] = result[l]
except:
print('BROKEN', x['id'])
continue
with open('data.json', 'w') as outfile:
json.dump(d, outfile)
Below is the variety of information from all CKAN data sources regarding CC licenses, and I'm not sure how best to represent the nuance in them all in a single "license" field.
Yeah, agreed. I've already started using a licenseUrl
field too so, I'd suggest:
licence
: a short code, preferably SPDX
licenseUrl
: link to full text
licenseName
: a longer name. I'd primarily use this when there just isn't anything that would work as an ID.
Easy. Will make some tweaks.
Done. Have just made up license names for licenses that aren't listed in SPDX. Ref: OGL-Surrey
, OGL-Toronto
, other-open
, and CC-BY
(where no version is listed).
I had this grand plan to programmatically determine which data sources were from CKAN portals, and then pull license information from the API, but I don't know enough JavaScript to do that. Here's where I got to in my attempts:
In lieu of that, I used a mixture of programmatic and manual methods to find CKAN API endpoints for all the datasets I could, and added a new field to each dataset called "ckan_api", which can be used to retrieve the JSON API object that should contain the
license_title
,license_id
, andlicense_url
, as well as other information if needed in future.Perhaps you're able to fairly trivially write a script to populate each dataset's license with the content from its JSON API information? This should partially address #34.
Given the API endpoints, these may also point to the last updated date for each dataset, thereby also potentially addressing #14.
I'm trying to see if I can find something similar for ArcGIS, but I think it's less consistent with its API endpoints. Will see how I go.