Wikidata and BHL and ALA

rdmpage commented 5 years ago

Do we add BHL and related bibliographic data to Wikidata. Would enable different identifiers to be cross linked, could add citation data, and do author reconciliation (e.g., via Ozymandias).

nicolekearney commented 5 years ago

Scope:

To create a linked database of Australian taxonomists: authors of descriptions of Australian Biota – zoology (primarily via the AFD database) across time • ORCID + other identifiers • Publications, with DOIs (where applicable) • Location • Birth Date • Gender

Purpose:

To feed this data back into Wikidata (and possibly produce a demo to show Australian Taxonomy on the off-chance “they” might be interested)

• To identify gaps in the data. • To create Wikidata entries for those gaps

Sort of queries we could ask of this database

• What’s the uptake of ORCID? • What’s the uptake of research gate? • How many taxonomists on twitter? • What’s the average age of Australian taxonomists? • What proportion of Australian species descriptions are published in Australian publications? • Are Australian taxonomists publishing in their own institution journals? • Who publishes in museum & herbaria journals? • Are overseas taxonomists studying Australian biota publishing in Australian journals? • What proportion of Australian taxonomy is published in open access journals? • What fraction of articles are open access? • What proportion of Australian taxonomists are from museums, herbaria, universities? • Can we connect taxonomists/people with the species names after them?

rdmpage commented 5 years ago

Potential problems:

Wikidata data can be unreliable, e.g. M.E. Shackleton is stated to be a female, which is false (first name is Michael, see https://scholars.latrobe.edu.au/display/meshackleton). Also this author has an ORCID https://orcid.org/0000-0002-8752-3167 but no DOIs are linked to any publications. So authors like this M E Shackleton are going to require some manual investigation.

I think we could end up having to post Github issues for individual authors.

dshorthouse commented 5 years ago

FWIW, was in touch with Kevin Thiele. He and the Taxonomy Australia group would indeed like to have a dynamic list of taxonomists - provided of course such a list is easily assembled and self-sustaining. Many created their ORCID accounts to help us explore what this might look like. Mixed results. Linking pubs needs to be done to help this work.

A fair number of Australians like Michael Shackleton and others do have specimens (some claimed in Bloodhound, some not). A quick and dirty start is to search ORCID using self-assigned keywords like "taxonomy" and country = "Australia". ORCID does not do demographics so you'd have to find/store these elsewhere like wikidata. If you make wikidata entries for deceased w/ birth & death dates and if you'd think it might help, I can add these to Bloodhound & claim some specimens, made available as JSON-LD. These could help create cyclic graphs of person, paper and type specimens.

rdmpage commented 5 years ago

@dshorthouse My thoughts so far are to generate a list of people who have ORCIDs and/or Wikidata entries from Ozymandias publication data (say, from publications in the 21st century) and work through those. Can automate to some extent via Wikispecies and ORCID API. Build a list for display purposes, then explore how much linking we can do (e.g., to publications, specimens, and taxa).

rdmpage commented 5 years ago

Example of an author without an ORCID but with a Scholia profile: Colette Cusset.

dshorthouse commented 5 years ago

As a corollary to:

What’s the uptake of ORCID?

...it may be interesting to view this by demographics. I'd expect that uptake of ORCID is minimal with baby boomers because many are now retired; there's no incentive to have and maintain one unless there's a significant value-added service.

rdmpage commented 5 years ago

@dshorthouse @nicolekearney OK, first stab at this is available in CSV https://github.com/rdmpage/australia/blob/master/wikidata/d.csv and (crude) HTML http://bionames.org/~rpage/australia/d.html . This is a list of authors of taxonomic works post 2010 in AFD, linked to ORCIDs, Wikidata, etc. All done automatically, there will be lots of misses and hidden ORCIDs. Wikidata looks to be the most common identifier (not really a surprise).

dshorthouse commented 5 years ago

Cool! Wonder how many of these I already have.

I found an "O F Anderson" in your list. Wikidata: https://www.wikidata.org/wiki/Q22107501. Note the ZooBank Author ID & links to Wikispecies. Scopus: https://www.scopus.com/authid/detail.uri?authorId=7101972403. And ORCID (eventually): https://orcid.org/0000-0002-2475-9375. I imagine there's some way to close all these...perhaps easiest via ZooBank...but damn, that's messy.

rdmpage commented 5 years ago

Actually O F Anderson works (in that Ozymandias finds some identifiers), I just didn't have it properly in the list (there are some issues parsing author names to generate URIs for Ozymandias authors). You can see him here: https://ozymandias-demo.herokuapp.com/?uri=https://biodiversity.org.au/afd/publication/%23creator/o-f-anderson (for some reason the link between work and taxon is missing, which is why the original table missed this name as I built a query for just works that published names).

Yes, it's messy, but it's partly because Ozymandias is a small subset of all the data. This has the advantage that authors with the same name are more likely to be the same person, but it means we may lack evidence to link people to identifiers (e.g., the Wikispecies page for that person may lack the papers on Australian animals and so we can't get the link automatically). I might investigate harvesting ZooBank to help resolve some of these cases, but yes it's a mess and potentially a black hole into which I'm not convinced I want to dive too deeply...

rdmpage commented 5 years ago

Part of the challenge is where to store this information, and how to make it accessible. Putting it in Wikidata is obviously a(the) goal, but how do we get it back in a useful way. For example, how do we query for identifiers for taxonomists who have published on Australian taxa? One way would;d be if the publications in Wikidata had the AFD Publication id as a property (it's a UUID) so I've proposed exactly that https://www.wikidata.org/wiki/Wikidata:Property_proposal/Generic#Australian_Faunal_Directory_Publication_ID

rdmpage commented 5 years ago

Another problematic ORCID example https://orcid.org/0000-0003-3728-3270 Ivan L. F. Magalhaes lists several publications, NONE with a DOI.

rdmpage commented 5 years ago

The proposal for a publication property has gone from “Under discussion” to “Ready”.

rdmpage commented 5 years ago

Note also that there is now a ALA taxon proposal https://www.wikidata.org/wiki/Wikidata:Property_proposal/Atlas_of_Living_Australia_ID

rdmpage commented 5 years ago

@nicolekearney And we are live! AFD publication property has been approved https://www.wikidata.org/wiki/Wikidata:Property_proposal/Australian_Faunal_Directory_publication_ID

rdmpage commented 5 years ago

To add Quickstatements go to https://tools.wmflabs.org/quickstatements/index_old.html

rdmpage commented 5 years ago

Query for AFD publications in Wikidata:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT * WHERE
{ ?work wdt:P6982 ?afd .
  ?work wdt:P50 ?author .
  ?author rdfs:label ?name .

  OPTIONAL {
    ?author wdt:P496 ?orcid .
   }

  FILTER (lang(?name) = 'en')
}

https://w.wiki/6CN

rdmpage commented 5 years ago

More detailed query:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT * WHERE
{ ?work wdt:P6982 ?afd .

  {
    ?work wdt:P50 ?author .
    ?author rdfs:label ?name .
    FILTER (lang(?name) = 'en')

  OPTIONAL {
    ?author wdt:P496 ?orcid .
   }
  }
  UNION
    {
      ?work wdt:P2093 ?name .
  }

}

rdmpage commented 5 years ago

Query to find authors of works in AFD:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT * WHERE
{ 
  ?work wdt:P6982 ?afd .
  {
    ?work wdt:P50 ?author .
    ?author rdfs:label ?name .
    FILTER (lang(?name) = 'en')

  OPTIONAL {
    ?author wdt:P496 ?orcid .
   }
  }
  UNION
    {
      ?work wdt:P2093 ?name .
  }
}

https://w.wiki/6EH

rdmpage commented 5 years ago

Query to find authors of works in AFD:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT * WHERE
{ 
  ?work wdt:P6982 ?afd .
  {
    ?work wdt:P50 ?author .
    ?author rdfs:label ?name .
    FILTER (lang(?name) = 'en')

  OPTIONAL {
    ?author wdt:P496 ?orcid .
   }
  }
  UNION
    {
      ?work wdt:P2093 ?name .
  }
}

https://w.wiki/6EH

rdmpage commented 5 years ago

Query to list ORCIDs for AFD authors:

SELECT DISTINCT ?author ?name (IRI(CONCAT("https://orcid.org/",?orcid)) AS ?orcid_url) WHERE
{ 
  ?work wdt:P6982 ?afd .
  {
    ?work wdt:P50 ?author .
    ?author rdfs:label ?name .
    FILTER (lang(?name) = 'en')

   ?author wdt:P496 ?orcid .
  }
}
ORDER BY ?name

https://w.wiki/6EP

rdmpage commented 5 years ago

Spreadsheet where we are mapping: https://docs.google.com/spreadsheets/d/13vslrX695XIM7U9BZrx4pQWOoXVID2ePR5AlxzoDcGM/edit#gid=775778698

rdmpage commented 5 years ago

ImportJSON.gs

/**
 * Retrieves all the rows in the active spreadsheet that contain data and logs the
 * values for each row.
 * For more information on using the Spreadsheet API, see
 * https://developers.google.com/apps-script/service_spreadsheet
 */
function readRows() {
  var sheet = SpreadsheetApp.getActiveSheet();
  var rows = sheet.getDataRange();
  var numRows = rows.getNumRows();
  var values = rows.getValues();

  for (var i = 0; i <= numRows - 1; i++) {
    var row = values[i];
    Logger.log(row);
  }
};

/**
 * Adds a custom menu to the active spreadsheet, containing a single menu item
 * for invoking the readRows() function specified above.
 * The onOpen() function, when defined, is automatically invoked whenever the
 * spreadsheet is opened.
 * For more information on using the Spreadsheet API, see
 * https://developers.google.com/apps-script/service_spreadsheet
 */
function onOpen() {
  var sheet = SpreadsheetApp.getActiveSpreadsheet();
  var entries = [{
    name : "Read Data",
    functionName : "readRows"
  }];
  sheet.addMenu("Script Center Menu", entries);
};

/*====================================================================================================================================*
  ImportJSON by Trevor Lohrbeer (@FastFedora)
  ====================================================================================================================================
  Version:      1.1
  Project Page: http://blog.fastfedora.com/projects/import-json
  Copyright:    (c) 2012 by Trevor Lohrbeer
  License:      GNU General Public License, version 3 (GPL-3.0) 
                http://www.opensource.org/licenses/gpl-3.0.html
  ------------------------------------------------------------------------------------------------------------------------------------
  A library for importing JSON feeds into Google spreadsheets. Functions include:

     ImportJSON            For use by end users to import a JSON feed from a URL 
     ImportJSONAdvanced    For use by script developers to easily extend the functionality of this library

  Future enhancements may include:

   - Support for a real XPath like syntax similar to ImportXML for the query parameter
   - Support for OAuth authenticated APIs

  Or feel free to write these and add on to the library yourself!
  ------------------------------------------------------------------------------------------------------------------------------------
  Changelog:

  1.1    Added support for the noHeaders option
  1.0    Initial release
 *====================================================================================================================================*/
/**
 * Imports a JSON feed and returns the results to be inserted into a Google Spreadsheet. The JSON feed is flattened to create 
 * a two-dimensional array. The first row contains the headers, with each column header indicating the path to that data in 
 * the JSON feed. The remaining rows contain the data. 
 * 
 * By default, data gets transformed so it looks more like a normal data import. Specifically:
 *
 *   - Data from parent JSON elements gets inherited to their child elements, so rows representing child elements contain the values 
 *      of the rows representing their parent elements.
 *   - Values longer than 256 characters get truncated.
 *   - Headers have slashes converted to spaces, common prefixes removed and the resulting text converted to title case. 
 *
 * To change this behavior, pass in one of these values in the options parameter:
 *
 *    noInherit:     Don't inherit values from parent elements
 *    noTruncate:    Don't truncate values
 *    rawHeaders:    Don't prettify headers
 *    noHeaders:     Don't include headers, only the data
 *    debugLocation: Prepend each value with the row & column it belongs in
 *
 * For example:
 *
 *   =ImportJSON("http://gdata.youtube.com/feeds/api/standardfeeds/most_popular?v=2&alt=json", "/feed/entry/title,/feed/entry/content",
 *               "noInherit,noTruncate,rawHeaders")
 * 
 * @param {url} the URL to a public JSON feed
 * @param {query} a comma-separated lists of paths to import. Any path starting with one of these paths gets imported.
 * @param {options} a comma-separated list of options that alter processing of the data
 *
 * @return a two-dimensional array containing the data, with the first row containing headers
 * @customfunction
 **/
function ImportJSON(url, query, options) {
  return ImportJSONAdvanced(url, query, options, includeXPath_, defaultTransform_);
}

/**
 * An advanced version of ImportJSON designed to be easily extended by a script. This version cannot be called from within a 
 * spreadsheet.
 *
 * Imports a JSON feed and returns the results to be inserted into a Google Spreadsheet. The JSON feed is flattened to create 
 * a two-dimensional array. The first row contains the headers, with each column header indicating the path to that data in 
 * the JSON feed. The remaining rows contain the data. 
 *
 * Use the include and transformation functions to determine what to include in the import and how to transform the data after it is
 * imported. 
 *
 * For example:
 *
 *   =ImportJSON("http://gdata.youtube.com/feeds/api/standardfeeds/most_popular?v=2&alt=json", 
 *               "/feed/entry",
 *                function (query, path) { return path.indexOf(query) == 0; },
 *                function (data, row, column) { data[row][column] = data[row][column].toString().substr(0, 100); } )
 *
 * In this example, the import function checks to see if the path to the data being imported starts with the query. The transform 
 * function takes the data and truncates it. For more robust versions of these functions, see the internal code of this library.
 *
 * @param {url}           the URL to a public JSON feed
 * @param {query}         the query passed to the include function
 * @param {options}       a comma-separated list of options that may alter processing of the data
 * @param {includeFunc}   a function with the signature func(query, path, options) that returns true if the data element at the given path
 *                        should be included or false otherwise. 
 * @param {transformFunc} a function with the signature func(data, row, column, options) where data is a 2-dimensional array of the data 
 *                        and row & column are the current row and column being processed. Any return value is ignored. Note that row 0 
 *                        contains the headers for the data, so test for row==0 to process headers only.
 *
 * @return a two-dimensional array containing the data, with the first row containing headers
 **/
function ImportJSONAdvanced(url, query, options, includeFunc, transformFunc) {
  var jsondata = UrlFetchApp.fetch(url, {'headers': { 'accept' : 'application/json'}});
  var object   = JSON.parse(jsondata.getContentText());

  return parseJSONObject_(object, query, options, includeFunc, transformFunc);
}

/** 
 * Encodes the given value to use within a URL.
 *
 * @param {value} the value to be encoded
 * 
 * @return the value encoded using URL percent-encoding
 */
function URLEncode(value) {
  return encodeURIComponent(value.toString());  
}

/** 
 * Parses a JSON object and returns a two-dimensional array containing the data of that object.
 */
function parseJSONObject_(object, query, options, includeFunc, transformFunc) {
  var headers = new Array();
  var data    = new Array();

  if (query && !Array.isArray(query) && query.toString().indexOf(",") != -1) {
    query = query.toString().split(",");
  }

  if (options) {
    options = options.toString().split(",");
  }

  parseData_(headers, data, "", 1, object, query, options, includeFunc);
  parseHeaders_(headers, data);
  transformData_(data, options, transformFunc);

  return hasOption_(options, "noHeaders") ? (data.length > 1 ? data.slice(1) : new Array()) : data;
}

/** 
 * Parses the data contained within the given value and inserts it into the data two-dimensional array starting at the rowIndex. 
 * If the data is to be inserted into a new column, a new header is added to the headers array. The value can be an object, 
 * array or scalar value.
 *
 * If the value is an object, it's properties are iterated through and passed back into this function with the name of each 
 * property extending the path. For instance, if the object contains the property "entry" and the path passed in was "/feed",
 * this function is called with the value of the entry property and the path "/feed/entry".
 *
 * If the value is an array containing other arrays or objects, each element in the array is passed into this function with 
 * the rowIndex incremeneted for each element.
 *
 * If the value is an array containing only scalar values, those values are joined together and inserted into the data array as 
 * a single value.
 *
 * If the value is a scalar, the value is inserted directly into the data array.
 */
function parseData_(headers, data, path, rowIndex, value, query, options, includeFunc) {
  var dataInserted = false;

  if (isObject_(value)) {
    for (key in value) {
      if (parseData_(headers, data, path + "/" + key, rowIndex, value[key], query, options, includeFunc)) {
        dataInserted = true; 
      }
    }
  } else if (Array.isArray(value) && isObjectArray_(value)) {
    for (var i = 0; i < value.length; i++) {
      if (parseData_(headers, data, path, rowIndex, value[i], query, options, includeFunc)) {
        dataInserted = true;
        rowIndex++;
      }
    }
  } else if (!includeFunc || includeFunc(query, path, options)) {
    // Handle arrays containing only scalar values
    if (Array.isArray(value)) {
      value = value.join(); 
    }

    // Insert new row if one doesn't already exist
    if (!data[rowIndex]) {
      data[rowIndex] = new Array();
    }

    // Add a new header if one doesn't exist
    if (!headers[path] && headers[path] != 0) {
      headers[path] = Object.keys(headers).length;
    }

    // Insert the data
    data[rowIndex][headers[path]] = value;
    dataInserted = true;
  }

  return dataInserted;
}

/** 
 * Parses the headers array and inserts it into the first row of the data array.
 */
function parseHeaders_(headers, data) {
  data[0] = new Array();

  for (key in headers) {
    data[0][headers[key]] = key;
  }
}

/** 
 * Applies the transform function for each element in the data array, going through each column of each row.
 */
function transformData_(data, options, transformFunc) {
  for (var i = 0; i < data.length; i++) {
    for (var j = 0; j < data[i].length; j++) {
      transformFunc(data, i, j, options);
    }
  }
}

/** 
 * Returns true if the given test value is an object; false otherwise.
 */
function isObject_(test) {
  return Object.prototype.toString.call(test) === '[object Object]';
}

/** 
 * Returns true if the given test value is an array containing at least one object; false otherwise.
 */
function isObjectArray_(test) {
  for (var i = 0; i < test.length; i++) {
    if (isObject_(test[i])) {
      return true; 
    }
  }  

  return false;
}

/** 
 * Returns true if the given query applies to the given path. 
 */
function includeXPath_(query, path, options) {
  if (!query) {
    return true; 
  } else if (Array.isArray(query)) {
    for (var i = 0; i < query.length; i++) {
      if (applyXPathRule_(query[i], path, options)) {
        return true; 
      }
    }  
  } else {
    return applyXPathRule_(query, path, options);
  }

  return false; 
};

/** 
 * Returns true if the rule applies to the given path. 
 */
function applyXPathRule_(rule, path, options) {
  return path.indexOf(rule) == 0; 
}

/** 
 * By default, this function transforms the value at the given row & column so it looks more like a normal data import. Specifically:
 *
 *   - Data from parent JSON elements gets inherited to their child elements, so rows representing child elements contain the values 
 *     of the rows representing their parent elements.
 *   - Values longer than 256 characters get truncated.
 *   - Values in row 0 (headers) have slashes converted to spaces, common prefixes removed and the resulting text converted to title 
*      case. 
 *
 * To change this behavior, pass in one of these values in the options parameter:
 *
 *    noInherit:     Don't inherit values from parent elements
 *    noTruncate:    Don't truncate values
 *    rawHeaders:    Don't prettify headers
 *    debugLocation: Prepend each value with the row & column it belongs in
 */
function defaultTransform_(data, row, column, options) {
  if (!data[row][column]) {
    if (row < 2 || hasOption_(options, "noInherit")) {
      data[row][column] = "";
    } else {
      data[row][column] = data[row-1][column];
    }
  } 

  if (!hasOption_(options, "rawHeaders") && row == 0) {
    if (column == 0 && data[row].length > 1) {
      removeCommonPrefixes_(data, row);  
    }

    data[row][column] = toTitleCase_(data[row][column].toString().replace(/[\/\_]/g, " "));
  }

  if (!hasOption_(options, "noTruncate") && data[row][column]) {
    data[row][column] = data[row][column].toString().substr(0, 256);
  }

  if (hasOption_(options, "debugLocation")) {
    data[row][column] = "[" + row + "," + column + "]" + data[row][column];
  }
}

/** 
 * If all the values in the given row share the same prefix, remove that prefix.
 */
function removeCommonPrefixes_(data, row) {
  var matchIndex = data[row][0].length;

  for (var i = 1; i < data[row].length; i++) {
    matchIndex = findEqualityEndpoint_(data[row][i-1], data[row][i], matchIndex);

    if (matchIndex == 0) {
      return;
    }
  }

  for (var i = 0; i < data[row].length; i++) {
    data[row][i] = data[row][i].substring(matchIndex, data[row][i].length);
  }
}

/** 
 * Locates the index where the two strings values stop being equal, stopping automatically at the stopAt index.
 */
function findEqualityEndpoint_(string1, string2, stopAt) {
  if (!string1 || !string2) {
    return -1; 
  }

  var maxEndpoint = Math.min(stopAt, string1.length, string2.length);

  for (var i = 0; i < maxEndpoint; i++) {
    if (string1.charAt(i) != string2.charAt(i)) {
      return i;
    }
  }

  return maxEndpoint;
}

/** 
 * Converts the text to title case.
 */
function toTitleCase_(text) {
  if (text == null) {
    return null;
  }

  return text.replace(/\w\S*/g, function(word) { return word.charAt(0).toUpperCase() + word.substr(1).toLowerCase(); });
}

/** 
 * Returns true if the given set of options contains the given option.
 */
function hasOption_(options, option) {
  return options && options.indexOf(option) >= 0;
}

rdmpage / australia

Wikidata and BHL and ALA #7