mjordan / larkm

A simple ARK resolver.
MIT License
1 stars 0 forks source link

larkm: A Lightweight ARK Manager

Overview

larkm is a simple ARK manager that can:

ARK resolution is provided via requests to larkm's host followed by an ARK (e.g. https://myhost.net/ark:12345/876543) and the other operations are provided through standard REST requests to larkm's management endpoint (/larkm). This REST interface allows creating, persisting, updating, and deleting ARKs, and can expose a subset of larkm's configuration data to clients. Access to the REST endpoints can be controlled by registering the IP addresses of trused clients, as explained in the "Configuration" section below.

larkm is considered "lightweight" because it supports only a subset of ARK functionality, focusing on providing ways to manage ARKs locally and on using ARKs as persistent, resolvable identifiers. ARK features such as suffix passthrough and ARK qualifiers are currently out of scope.

Requirements

Usage

Creating the database

larkm provides an empty sqlite database that you can use, larkm_template.db in the extras directory.

If you want to create your own, run the following commands:

  1. sqlite3 path/to/mydb.db
  2. within sqlite, run CREATE TABLE arks(date_created TEXT NOT NULL, date_modified TEXT NOT NULL, shoulder TEXT NOT NULL, identifier TEXT NOT NULL, ark_string TEXT NOT NULL, target TEXT NOT NULL, erc_who TEXT NOT NULL, erc_what TEXT NOT NULL, erc_when TEXT NOT NULL, erc_where TEXT NOT NULL, policy TEXT NOT NULL);
  3. .quit

Configuration

larkm uses a JSON configuration file in the same directory as larkm.py named larkm.json. Copy the sample configuration file, larkm.json.sample, to larkm.json, make any changes you need, and you are good to go.

The config settings are:

{
  "default_naan": "99999",
  "allowed_naans": ["11111", "22222", "33333"],
  "default_shoulder": "s1",
  "allowed_shoulders": ["s8", "s9", "x9", "z1"],
  "committment_statement": {
       "s1": "ACME University commits to maintain ARKs that have 's1' as a shoulder for a long time.",
       "s8": "ACME University commits to maintain ARKs that have 's8' as a shoulder until the end of 2025.",
       "default": "Default committment statement."
  },
  "erc_metadata_defaults": {
       "who": ":at",
       "what": ":at",
       "when": ":at"
  },
  "sqlite_db_path": "fixtures/larkmtest.db",
  "log_file_path": "/tmp/larkm.log",
  "resolver_hosts": {
     "global": "https://n2t.net/",
     "local": "https://resolver.myorg.net"
  },
  "whoosh_index_dir_path": "index_dir",
  "trusted_ips": ["142.58.23.213", "142.59.78.175"],
  "api_keys": ["d9771c6c-b9d0-4dc3-8549-e17ddfc12826", "some__--random--string"]
}

Starting larkm

To start the larkm app with the local Uvicorn web server, in a terminal run python3 -m uvicorn larkm:app

Resolving an ARK

Visit http://127.0.0.1:8000/ark:12345/x9062cdde7-f9d6-48bb-be17-bd3b9f441ec4 using curl -Lv. You will see a redirect to https://example.com/foo.

To see the configured metadata and committment statement for the ARK instead of resolving to its target, append ?info to the end of the ARK, e.g., http://127.0.0.1:8000/ark:12345/x9062cdde7-f9d6-48bb-be17-bd3b9f441ec4?info.

To comply with the ARK specification, the hyphens in the identifier are optional. Therefore, http://127.0.0.1:8000/ark:12345/x9062cdde7-f9d648bbbe17bd3--b9f441ec4 is equivalent to http://127.0.0.1:8000/ark:12345/x9062cdde7-f9d6-48bb-be17-bd3b9f441ec4. Since hyphens are integral parts of UUIDs, larkm restores the hyphens to their expected location within the UUID to perform its lookups during resolution. Hyphens in UUIDs are optional/ignored only when resolving an ARK. They are required for all other operations described below.

Creating a new ARK

REST clients can provide a naan, a shoulder and/or an identifer value in the requst body.

To add a new ARK (for example, to resolve to https://digital.lib.sfu.ca), issue the following request using curl (the configured default NAAN is 12345):

curl -v -X POST "http://127.0.0.1:8000/larkm" -H 'Content-Type: application/json' -d '{"shoulder": "s1", "identifier": "fde97fb3-634b-4232-b63e-e5128647efe7", "target": "https://digital.lib.sfu.ca"}'

If you now visit http://127.0.0.1:8000/ark:12345/s1fde97fb3-634b-4232-b63e-e5128647efe7, you will be redirected to https://digital.lib.sfu.ca.

If you omit the shoulder, the configured default shoulder will be used. If you omit the identifier, larkm will mint one using a v4 UUID.

If you provide a NAAN, and it is in the configured list of allowed_naans, it will be used instead of the NAAN configured as the default_naan:

curl -v -X POST "http://127.0.0.1:8000/larkm" -H 'Content-Type: application/json' -d '{"naan": "454545", "shoulder": "s1", "identifier": "fde97fb3-634b-4232-b63e-e5128647efe7", "target": "https://digital.lib.sfu.ca"}'

All responses to a POST will include in their body the values values provided in the POST request, plus any default values for missing body fields. The where value will be identical to the provided ark_string and cannot be populated on its own. Metadata values not provided will get the ERC ":at" ("the real value is at the given URL or identifier") value:

{"ark":{"shoulder": "s1", "identifier": "fde97fb3-634b-4232-b63e-e5128647efe7", "ark_string":"ark:12345/s1fde97fb3-634b-4232-b63e-e5128647efe7","target":"https://digital.lib.sfu.ca", "who":":at", "when":":at", "where":"ark:12345/s1fde97fb3-634b-4232-b63e-e5128647efe7", "what":":at"}, "urls":{"local":"https://resolver.myorg.net/ark:12345/s1fde97fb3-634b-4232-b63e-e5128647efe7","global":"https://n2t.net/ark:99999/s1fde97fb3-634b-4232-b63e-e5128647efe7"}}

Also included in the response are values for global and local urls.

Updating an ARK's properties

You can update an existing ARK's ERC metadata, policy statement, or target. However, an ARK's naan, shoulder, identifier, and ark_string are immutable and cannot be updated. ark_string is the only required body field, and the ARK NAAN, shoulder, and identifier provided in the PUT request URL must match those in the "ark_string" body field. Properties included in the request body will be updated.

Some sample queries:

curl -v -X PUT "http://127.0.0.1:8000/larkm/ark:12345/s1fde97fb3-634b-4232-b63e-e5128647efe7" -H 'Content-Type: application/json' -d '{"ark_string": "ark:12345/s1fde97fb3-634b-4232-b63e-e5128647efe7", "target": "https://summit.sfu.ca"}'

curl -v -X PUT "http://127.0.0.1:8000/larkm/ark:12345/s1fde97fb3-634b-4232-b63e-e5128647efe7" -H 'Content-Type: application/json' -d '{"ark_string": "ark:12345/s1fde97fb3-634b-4232-b63e-e5128647efe7", "who": "Jordan, Mark", "when": "2020", "policy": "We will maintain this ARK for a long time."}'

Including where in the request body will result in an HTTP 409 response with the message 'where' is automatically assigned the value of the ark string and cannot be updated.

Deleting an ARK

Delete an ARK using a request like:

curl -v -X DELETE "http://127.0.0.1:8000/larkm/ark:12345/s1fde97fb3-634b-4232-b63e-e5128647efe7"

If the ARK was deleted, larkm returns a 204 No Content response with no body. If the ARK was not found, larkm returns a 404 response with the body {"detail":"ARK not found"}.

Getting larkm's configuration data

curl -v "http://127.0.0.1:8000/larkm/config"

Note that larkm returns only the subset of configuration data that clients need to create new ARKs, specifically the "default_shoulder", "allowed_shoulders", "committment_statement", and "erc_metadata_defaults" configuration data. Only clients whose IP addresses are listed in the trusted_ips configuration option may request configuration data.

Shoulders

Following ARK best practice, larkm requires the use of shoulders in newly added ARKs. Shoulders allowed within your NAAN are defined in the "default_shoulder" and "allowed_shoulders" configuration settings. When a new ARK is added, larkm will validate that the ARK string starts with either the default shoulder or one of the allowed shoulders. Note however that larkm does not validate the format of shoulders.

Metadata support

larkm supports the Electronic Resource Citation (ERC) metadata format expressed in ANVL syntax. Note that larkm accepts the raw values provided by the client and does not validate or format the values against any schema.

target is not an ERC property. It is used internally by larkm to simplify resolution to an HTTP[S] URL.

Searching metadata

larkm supports fulltext indexing of ERC metadata and other ARK properties via the Whoosh indexer. This feature is not intended as a general-purpose, end-user search interface but rather to be used for administrative purposes. Access to the /larkm/search endpoint is restricted to the IP addresses registered in the "trusted_ips" configuration setting.

A simple example search is:

http://127.0.0.1:8000/larkm/search?q=erc_what:water

If the search was successful, larkm returns a 200 HTTP status code. A successful result contains a JSON string with keys "num_results", "page", "page_size", and "arks".

{
    "num_results": 2,
    "page": 1,
    "page_size": 20,
    "arks": [
      {
        "date_created": "2022-06-23 03:00:45",
        "date_modified": "2022-06-23 03:00:45",
        "shoulder": "s1",
        "identifier": "cea8e7f3-1c84-4919-a694-65bc9997d9fe",
        "ark_string": "ark:99999/s1cea8e7f3-1c84-4919-a694-65bc9997d9fe",
        "target": "http://example.com/15",
        "erc_who": "Derex Godfry",
        "erc_what": "5 Ways to Immediately Start Selling Water",
        "erc_when": ":at",
        "erc_where": "ark:99999/s1cea8e7f3-1c84-4919-a694-65bc9997d9fe",
        "policy": "We commit to keeping this ARK actionable until 2030."
      },
      {
        "date_created": "2022-06-23 03:00:45",
        "date_modified": "2022-06-23 03:00:45",
        "shoulder": "s1",
        "identifier": "714b3160-e138-49ed-969a-a514f034274f",
        "ark_string": "ark:99999/s1714b3160-e138-49ed-969a-a514f034274f",
        "target": "http://example.com/16",
        "erc_who": "Toriana Kondo",
        "erc_what": "Water in Crisis: The Coming Shortages",
        "erc_when": ":at",
        "erc_where": "ark:99999/s1714b3160-e138-49ed-969a-a514f034274f",
        "policy": ":at"
      }
    ]
  }

If no results were found, larkm returns a 200 HTTP status code and the same JSON structure, but with a num_results value of 0 and an empty arks list:

{"num_results":0,"page":1,"page_size":"20","arks":[]}

If larkm cannot find the Whoosh index directory (or one is not configured), it returns a 204 (No content).

The request parameters for the /larkm/search endpoint are:

Searching uses the default Whoosh query language, which supports boolean operators "AND", "OR", and "NOT", phase searches, and wildcards. Some example queries (not URL-encoded for easy reading) are:

Building the search index

Updating the index is not done in realtime; instead, it is generated using the "index_arks.py" script provided in the "extras" directory, which indexes every row in the larkm sqlite3 database. This script would typically scheduled using cron but can be run manually. A typical cron entry looks like this:

* * * * * /usr/bin/python3 /path/to/larkm/extras/index_arks.py /path/to/larkm/larkm.json

If you run the indexer via cron, make sure the paths in sqlite_db_path and whoosh_index_dir_path configuration settings are absolute.

Using the Names to Things global resolver

If you have a registered NAAN that points to the server running larkm, you can use the Names to Things global ARK resolver's domain redirection feature by replacing the hostname of the server larkm is running on with https://n2t.net/. For example, if the local server larkm is running on is https://ids.myorg.ca, and your insitution's NAAN is registered to use that hostname, you can use a local instance of larkm to manage ARKs like https://n2t.net/ark:12345/s1fde97fb3-634b-4232-b63e-e5128647efe7 (using your NAAN instead of 12345) and they will resolve through your local larkm running on https://ids.myorg.ca to their target URLs.

An advantage of doing this is that if your local resolver needs to be changed from https://ids.myorg.ca/ to another host, assuming you update your NAAN record to use the new host, requests to https://n2t.net/ark:12345/s1fde97fb3-634b-4232-b63e-e5128647efe7 will continue to resolve to their targets.

API docs

Thanks to OpenAPI, you can see larkm's API docs by visiting http://127.0.0.1:8000/docs#.

Logging

larkm provides basic logging of requests to its resolver endpoint (i.e., /ark:foo/bar). The path to the log is set in the "log_file_path" configuration option. To disable logging, use false as the value of this option. The log is a tab-delimited file containing a datestamp, the client's IP address, the requested ARK string, the corresponding target URL (or "ARK not found" if the requested ARK was not found, or "info" if the request was for the ARK's metadata), and the HTTP referer. If the referer is not available, the value in the TSV entry is "null". Errors and warnings are also logged.

Scripts

The "extras" directory contains two sample scripts:

  1. a script to test larkm's performance
  2. a script to mint ARKs from a CSV file
  3. a script to mint ARKs from the output of the larkm Integration Drupal module
  4. a script to build the Whoosh search index from entries in the database

Instructions are at the top of each file.

Development

License

MIT