stevage / OpenTrees

Front end for opentrees.org, a data visualisation of millions of publicly maintained trees around the world.
171 stars 16 forks source link

Updates to data source information #36

Open mattcen opened 4 years ago

mattcen commented 4 years ago

I had this grand plan to programmatically determine which data sources were from CKAN portals, and then pull license information from the API, but I don't know enough JavaScript to do that. Here's where I got to in my attempts:

#!/usr/bin/env node

var fs = require('fs');

var obj = JSON.parse(fs.readFileSync('sources-out.json', 'utf8'));

for(var i in obj) {
  d = obj[i];

  dl = d['download']
  if(/.*data\.gov\.au/.test(dl))
  {
    if(/\/geoserver\//.test(dl))
      ds = dl.replace(/.*geoserver\//,"").replace(/\/.*/,"");
    else 
      ds = dl.replace(/.*dataset\//,"").replace(/\/.*/,"");
    url = `https://data.gov.au/data/dataset/${ds}`;
    // SET info HERE
    d['info'] = url;

    var api = `https://data.gov.au/api/3/action/package_show?id=${ds}`;
    console.log(api);
    /* Then, in a shell
     ./get_ckan_url.js | while read -r ds; do echo "$ds"; curl -s "$ds" | jq .result.license_title; done
   */
  }
}

In lieu of that, I used a mixture of programmatic and manual methods to find CKAN API endpoints for all the datasets I could, and added a new field to each dataset called "ckan_api", which can be used to retrieve the JSON API object that should contain the license_title, license_id, and license_url, as well as other information if needed in future.

Perhaps you're able to fairly trivially write a script to populate each dataset's license with the content from its JSON API information? This should partially address #34.

Given the API endpoints, these may also point to the last updated date for each dataset, thereby also potentially addressing #14.

I'm trying to see if I can find something similar for ArcGIS, but I think it's less consistent with its API endpoints. Will see how I go.

mattcen commented 4 years ago

I've added an arcgis_page field to appropriate datasets that links to a consistently formatted web page. I haven't yet worked out if there's a consistent way to determine what, if any, license a given dataset has from here; work in progress.

stevage commented 4 years ago

Hmm, interesting approach! I'm a little bit skeptical that all the "CKAN" instances around the world expose exactly the same APIs with the same information in them. My experiences in the past, even just with CKAN's in Australia turned up enough weird edge cases etc.

mattcen commented 4 years ago

While it's true that various CKAN portals use addons to augment the metadata fields and data structures for their data, the license fields are standard to CKAN, so are likely going to be consistent across installations (leaving the possibilities as either "present" or "missing" rather "in some other obscure field that we don't know the name of"). I acknowledge though that some metadata on when data has been updated is sometimes put in other odd fields, though.

mattcen commented 4 years ago

How fussy do you want to be about license names?

Below is the variety of information from all CKAN data sources regarding CC licenses, and I'm not sure how best to represent the nuance in them all in a single "license" field.

license_id license_title license_url
cc-0 CC-0 http://creativecommons.org/publicdomain/zero/1.0/deed.nl
cc-by Attribution (BY 4.0) https://www.donneesquebec.ca/fr/licence/#cc-by
cc-by CC-BY http://creativecommons.org/licenses/by/4.0/deed.nl
cc-by Creative Commons Attribution 3.0 Australia http://creativecommons.org/licenses/by/3.0/au/
cc-by Creative Commons Attribution http://creativecommons.org/licenses/by/4.0
cc-by Creative Commons Attribution http://www.opendefinition.org/licenses/cc-by
cc-by-2.5 Creative Commons Attribution 2.5 Australia http://creativecommons.org/licenses/by/2.5/au/
CC-BY-4.0 Creative Commons Attribution 4.0 https://creativecommons.org/licenses/by/4.0/

I wrote this script which got me 95% of the way there to putting the license information into the source JSON file using the CKAN field names. (I ran it through jq . afterwards to format it to look like the original file, and needed to hand-hold santiago and koeln.)

#!/usr/bin/env python3

import json
import urllib.request
import ssl
import pprint

# Don't do SSL verification because it didn't work right away and I can't be bothered debugging it
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

with open('../sources-out.json') as json_data:
    d = json.load(json_data)

for x in d:
    if 'ckan_api' in x:
        api = x['ckan_api']
        with urllib.request.urlopen(api, context=ctx) as url:
            data = json.loads(url.read().decode())
            if 'result' not in data:
                continue
            else:
                result = data['result']
            print(x['id'])
            try:
                for l in ( 'license_id', 'license_title', 'license_url' ):
                    if l in result:
                        x[l] = result[l]
            except:
                print('BROKEN', x['id'])
                continue

with open('data.json', 'w') as outfile:
    json.dump(d, outfile)
stevage commented 4 years ago

Below is the variety of information from all CKAN data sources regarding CC licenses, and I'm not sure how best to represent the nuance in them all in a single "license" field.

Yeah, agreed. I've already started using a licenseUrl field too so, I'd suggest:

licence: a short code, preferably SPDX licenseUrl: link to full text licenseName: a longer name. I'd primarily use this when there just isn't anything that would work as an ID.

mattcen commented 4 years ago

Easy. Will make some tweaks.

mattcen commented 4 years ago

Done. Have just made up license names for licenses that aren't listed in SPDX. Ref: OGL-Surrey, OGL-Toronto, other-open, and CC-BY (where no version is listed).