togoid / togoid-config

Definition of link data for TogoID
MIT License
10 stars 11 forks source link

TogoID config

Update procedure and description of link data for TogoID.

Link diagram

Link data

Pair of database IDs in the tab separated value (TSV) format.

DB1ID1  DB2IDx
DB1ID2  DB2IDy
DB1ID3  DB2IDz
 :

Config

Rakefile

Resolve dependencies of update procedure and preparation of common input files for each source DB.

dataset.yaml

A list of datasets (source and target datasets used in the 1st and 2nd columns of the TSV link data, respectively).

# Dataset name (in snake_case) for TogoID which can be a subset of original database divided by the category.
ec:
  # Human readable label of the dataset (intended to be used in a Web UI)
  label: Enzyme nomenclature
  # Database identifier provided by the Integbio Database Catalog https://integbio.jp/dbcatalog/
  catalog: nbdc01883
  # Primary category of the database (category must be defined in the TogoID ontology)
  category: Function
  # Regular expression used for automatic detection of the dataset from identifiers given by users.
  # If only a part of the user input should be recognized as an identifier, use a named capture to indicate the part.
  regex: '^(?:EC:)?(?<id>\d+\.(?:(?:-\.-\.-)|\d+\.(?:(?:-\.-)|\d+\.(?:-|n?\d+))))$'
  # URI prefix (intended to be used as a URI prefix in RDF)
  prefix: http://identifiers.org/ec-code/
  # (Optional) ID format that can be options for output (intended to be used in a Web UI)
  format: ["EC:%s"]
  # Example IDs which are accepted by the TogoID service (thus different types of IDs can be included)
  examples:
    - ["1.6.3.1","2.4.1.353","1.1.1.288","1.5.1.2","3.1.1.71","1.3.1.31","3.5.1.29","1.16.1.1","3.1.3.48","2.3.1.138"]
    - ["EC:1.6.3.1","EC:2.4.1.353","EC:1.1.1.288","EC:1.5.1.2","EC:3.1.1.71","EC:1.3.1.31","EC:3.5.1.29","EC:1.16.1.1","EC:3.1.3.48","EC:2.3.1.138"]
  # (Optional) Command to create an id-label tsv file
  method: sparql_csv2tsv.sh -w $TOGOID_ROOT/bin/sparql/ec_label.rq https://rdfportal.org/sib/sparql
hgnc:
  label: HGNC
  catalog: nbdc01774
  category: Gene
  prefix: http://identifiers.org/hgnc/
pubchem_compound:
  label: PubChem compound
  catalog: nbdc00641
  category: Compound
  prefix: 'https://identifiers.org/pubchem.compound/'
pubchem_substance:
  label: PubChem substance
  catalog: nbdc00642
  category: Compound
  prefix: 'https://identifiers.org/pubchem.substance/'

config.yaml

Update procedure of link data and metadata for pair of datasets with their relation including definitions of forward/reverse predicates for RDF generation.

# Relation of the pair of database identifiers (e.g., hgnc-ec)
link:
  # Forward link (source to target), predicate must be defined in the TogoID ontology
  forward: TIO_000028
  # Reverse link (target to source)
  reverse: TIO_000029
  # Example file name(s) of link data (only for testing)
  file: sample.tsv

# Metadata for updating link data
update:
  # How often the source data is updated
  frequency: Bimonthly
  # Update procedure of link data (can be a script name or a command line)
  method: sparql_csv2tsv.sh query.rq "http://sparql.med2rdf.org/sparql"

Recommended to use Dublin Core's Frequency Vocabulary DCFreq terms to specify the update frequency.

Ontology

Dependencies:

TogoID ontology (TIO) is introduced to semantically describe the datasets and the relations between datasets in TogoID.

Usage

Rakefile

Dependencies:

To update and convert all files:

% rake >& `date +%F`.log

To update and convert all files in parallel:

% rake -m -j 4

To update all TSV files:

% rake update

To convert all TSV files into Turtle files:

% rake convert

To update a 'output/tsv/db1-db2.tsv' file:

% rake output/tsv/db1-db2.tsv

To obtain a 'output/ttl/db1-db2.ttl' file:

% rake output/ttl/db1-db2.ttl

Rakefile in Docker

Build locally:

$ git clone https://github.com/dbcls/togoid-config
$ cd togoid-config
$ docker build -t togoid:test .
$ docker run -it --rm --user $(id -u):$(id -g) -v $(pwd)/input:/togoid/input -v $(pwd)/output:/togoid/output -w /togoid togoid:test rake -m -j 16 update

Or by using a container hosted on GitHub container registry

$ git clone https://github.com/dbcls/togoid-config
$ cd togoid-config
$ docker run -it --rm --user $(id -u):$(id -g) -v $(pwd)/input:/togoid/input -v $(pwd)/output:/togoid/output -w /togoid ghcr.io/dbcls/togoid:3455a5a rake -m -j 16 update

togoid-config

To test the syntax of the config YAML file:

% ruby bin/togoid-config config/db1-db2 test

To update link data (output/tsv/db1-db2.tsv) from the data source:

% ruby bin/togoid-config config/db1-db2 update

To generate a RDF/Turtle file (output/ttl/db1-db2.ttl) for the given link data:

% ruby bin/togoid-config config/db1-db2 convert

togoid-config-summary

To summarize all config settings:

% ruby bin/togoid-config-summary config/*/config.yaml > config-summary.tsv
% vd config-summary.tsv

To see the database update frequency:

% ruby bin/togoid-config-summary config/*/config.yaml | cut -f1,16

To see the database update method:

% ruby bin/togoid-config-summary config/*/config.yaml | cut -f1,17

togoid-config-summary-dot

Dependencies:

To visualize config relations:

% ruby bin/togoid-config-summary config/*/config.yaml | ruby bin/togoid-config-summary-dot > togoid.dot
% dot -Kdot -Ppng togoid.dot -otogoid.png
% open togoid.png

The option --id indicates to include identifiers of nodes (dataset IDs) and predicates of edges.

% ruby bin/togoid-config-summary config/*/config.yaml | ruby bin/togoid-config-summary-dot --id > togoid.dot

Also try some other visualization layouts and options:

% dot -Kcirco -Ppng togoid.dot -otogoid.png
% dot -Kfdp -Ppng togoid.dot -otogoid.png

The figure in this repository is generated by the following commands:

% ruby bin/togoid-config-summary config/*/config.yaml > docs/dot/togoid.sum
% ruby bin/togoid-config-summary-dot --id docs/dot/togoid.sum > docs/dot/togoid.dot
% dot -Nshape=box -Nstyle=filled,rounded -Ecolor=gray -Kdot -Tpng docs/dot/togoid.dot -odocs/dot/togoid.png