jayvdb / era_data

ERA ( Excellence in Research for Australia ) reference data
0 stars 0 forks source link

Load ABS FORs #6

Open jayvdb opened 8 years ago

jayvdb commented 8 years ago

ERA uses the ABS FOR (2008) research classification vocabulary, which is described at https://en.wikipedia.org/wiki/Australian_and_New_Zealand_Standard_Research_Classification, and the origin is http://www.abs.gov.au/ausstats/abs@.nsf/0/6BB427AB9696C225CA2574180004463E

The ERA key documents includes a matrix of the ABS FORs, with other associated business process information, which could be useful to load. It was not included in the techpack, but comes as a separate and quite small (45KB) XLS file such as http://content.webarchive.nla.gov.au/gov/wayback/20140212022156/http://www.arc.gov.au/xls/era12/ERA_2012_Discipline_Matrix.xls

However it would be good to fetch this reference data from the source (ABS), possibly even in a separate repository so it can be used for non-ERA purposes. The source also includes other tightly related official reference data, such as mapping from the 2008 vocabulary to the ABS' 2003 vocabulary, and other vocabulary.

Doing analysis here, both looking for a simple solution to get this functional asap, and collecting notes for the 'right' solution.

jayvdb commented 8 years ago

Callista Research and other systems I have developed had a non-standard XML format for this data, which was used to create JSON and populate databases. The XML was also used to make sense of ERA SEER XML documents using XSLT. It would be nice to use a standardised XML format.

Searching on github for FOR codes and descriptions...

Mark Gregson produced a non-standard XML version http://files.eprints.org/564/

There are a few systems using a CSV file, or similar static files, for the data https://github.com/jcu-eresearch/tdh.metadata/blob/master/tdh/metadata/browser/for_codes.csv https://github.com/IntersectAustralia/acdata/blob/master/config/FOR_CodeList.csv https://github.com/IntersectAustralia/metadata-aggregator/blob/master/sydma-install/resource/research_subject_code.csv https://github.com/rrothwell/nectar_visualisation/blob/master/web/data/for_codes_final_2.json https://github.com/datagovau/ckanext-agls/blob/master/ckanext/agls/ABS%20Fields%20Of%20Research.csv https://github.com/au-research/ANDS-Registry-Core/blob/master/etc/misc/vocabularies/anzsrc-for/ANZSRC-FOR-EXPORT.csv https://github.com/rd-switchboard/RD-Switchboard-Net/blob/master/etc/misc/vocabularies/anzsrc-for/ANZSRC-FOR-EXPORT.csv https://github.com/NeCTAR-RC/nectar-dashboard/blob/832e99b0ea736ee36adb556cdf3e73c9b1c7a340/nectar_dashboard/rcallocation/migrations/0001_initial.py https://github.com/NeCTAR-RC/nectar-dashboard/blob/832e99b0ea736ee36adb556cdf3e73c9b1c7a340/nectar_dashboard/rcallocation/for_choices.py https://github.com/NeCTAR-RC/langstroth/blob/master/nectar_allocations/models/forcode.py https://github.com/IntersectAustralia/dc2c/blob/master/mecat/subject_codes.py https://github.com/sprinsloo/Research-Flagship/blob/master/build/reporting/fields-research.html / https://github.com/sprinsloo/Research-Flagship/blob/master/source/reporting/fields-research.html.erb https://github.com/IntersectAustralia/exsite9/blob/master/exsite9/rootfiles/configuration/fieldsOfResearch.sql https://github.com/IntersectAustralia/ap11_webapp/blob/master/db/create_research_subject_code.sql https://github.com/anu-doi/DataCommons/blob/master/DataCommons/extras/sql/20120620_create_select_codes_table.sql https://github.com/CurtinUniversity/Research-Data-Manager/blob/master/Urdms.Dmp/Urdms.Dmp/Database/Migrations/20110906145800_CreateFieldOfResearchList.cs

With only two columns, it isnt possible to record some of the niggly details about FORs, such as when a non-precise code is usable for classification (there were one or two of these, but maybe they can be inferred algorithmically (like no child nodes..))

https://github.com/anzsrco/anzsrco is described as "Unofficial AusNZ Standard Research Classification Ontology", and has two branches: https://github.com/anzsrco/anzsrco/tree/master https://github.com/anzsrco/anzsrco/tree/gh-pages , which is http://anzsrco.github.io/anzsrco/

ANDS has the FORs available as an XML vocab. https://vocabs.ands.org.au/anzsrc-for ANDS is recommending that datasets include FORs in the RIF-CS data. See http://guides.ands.org.au/rda-cpg/describecpas

RIFCS generated from Java .. https://github.com/eresearchrmit/seaports-pacific/blob/master/src/main/java/edu/rmit/eres/seaports/controller/RIFCSController.java

https://github.com/AustralianAntarcticDataCentre/metadata_xml_convert/search?utf8=%E2%9C%93&q=anzsrc contains a subset of the FORs in a GMX XML standard format, used by XSLT. e-atlas and other systems are using this same system.

https://github.com/mlwbarlow/scripts-as-required/blob/master/python/RDACollectionsSubjectsReport.py loads FORs into a database.

https://github.com/dedickinson/forcsv/ has a neat project that loads FORs and SEOs into a Hyper SQL Database. It doesnt using travis-ci, or have tests, but definitely worth investigating further.

Some other more structured approaches https://github.com/uqlibrary/fez/blob/master/.docker/development/backend/db/seed/cvs.sql https://github.com/IntersectAustralia/ap11_webapp/blob/master/db/data.yml https://github.com/anzsrco/anzsrco/blob/39496a380a6ee593dcd9250dfad01dd0320f6e67/versions/0.1/for08.n3 https://github.com/gu-eresearch/VIVO/blob/b1783c0f7486f963821bec9123b7998bd02c537b/productMods/WEB-INF/filegraph/tbox/for08.n3

It looks like the best approach is the upgrade the anzsrco repo to meet the ERA needs.