gitonthescene / csv-reconcile

A reconciliation service for OpenRefine serving data from a given CSV file.
MIT License
70 stars 8 forks source link
data-science openrefine

+OPTIONS: ^:nil

** Quick start

** Poetry *** Prerequesites You'll need to have both [[https://python-poetry.org/docs/][poetry]] and [[https://pypi.org/project/poethepoet/0.0.3/][poethepoet]] installed. For publishing to [[https://pypi.org/][PyPI]] [[https://pandoc.org/][pandoc]] is required.

*** Running This is packaged with [[https://python-poetry.org/docs/][poetry]], so you can use those commands if you have it installed. : $ poe install : $ poetry run csv-reconcile init sample/reps.tsv item itemLabel : $ poetry run csv-reconcile serve

*** Building Because this package uses a ~README.org~ file and ~pip~ requires a ~README.md~, there are extra build steps beyond what ~poetry~ supplies. These are managed using [[https://pypi.org/project/poethepoet/0.0.3/][poethepoet]]. Thus building is done as follows:

: $ poe build

If you want to build a platform agnostic wheel, you'll have to comment out the ~build =
"build.py"~ line from ~pyproject.toml~ until ~poetry~ supports [[https://github.com/python-poetry/poetry/issues/3594][selecting build platform]].

** Description

This reconciliation service uses [[https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient][Dice coefficient scoring]] to reconcile values against a given column in a [[https://en.wikipedia.org/wiki/Comma-separated_values][CSV]] file. The CSV file must contain a column containing distinct values to reconcile to. We'll call this the /id column/. We'll call the column being reconciled against the /name column/.

For performance reasons, the /name column/ is preprocessed to normalized values which are stored in an [[https://www.sqlite.org/index.html][sqlite]] database. This database must be initialized at least once by running the init sub-command. Once initialized this need not be run for subsequent runs.

Note that the service supplies all its data with a dummy /type/ so there is no reason to reconcile against any particular /type/.

In addition to reconciling against the /name column/, the service also functions as a [[https://reconciliation-api.github.io/specs/latest/#data-extension-service][data extension service]], which offers any of the other columns of the CSV file.

Note that Dice coefficient scoring is agnostic to word ordering.

** Usage

Basic usage involves two steps:

*** Initialization

Basic usage of the ~init~ sub-command requires passing the name of the CSV file, the /id column/ and the /name column/.

: (venv) $ csv-reconcile --help : Usage: csv-reconcile [OPTIONS] COMMAND [ARGS]... : : Options: : --help Show this message and exit. : : Commands: : init : run : serve : (venv) $ csv-reconcile init --help : Usage: csv-reconcile init [OPTIONS] CSVFILE IDCOL NAMECOL : : Options: : --config TEXT config file : --scorer TEXT scoring plugin to use : --help Show this message and exit. : (venv) $ poetry run csv-reconcile serve --help : Usage: csv-reconcile serve [OPTIONS] : : Options: : --help Show this message and exit. : (venv) $

The ~--config~ option is used to point to a configuration file. The file is a [[https://flask.palletsprojects.com/en/1.1.x/config/][Flask configuration]] and hence is Python code though most configuration is simply setting variables to constant values.

*** Running the service The simplest way to run the service is to use Flask's built-in web server with the ~serve~ subcommand which takes no arguments. However, as mentioned in the [[https://flask.palletsprojects.com/en/2.0.x/deploying/][Flask documentation]], this server is not suitable for production purposes.

For a more hardened service, you can use one of the other deployment options mentioned in that
documentation.  For example, gunicorn can be run as follows:

: (venv) $ gunicorn -w 4 'csv_reconcile:create_app()'
: 1-11-16 17:40:20 +0900] [84625] [INFO] Starting gunicorn 20.1.0
: 1-11-16 17:40:20 +0900] [84625] [INFO] Listening at: http://127.0.0.1:8000 (84625)
: 1-11-16 17:40:20 +0900] [84625] [INFO] Using worker: sync
: 1-11-16 17:40:20 +0900] [84626] [INFO] Booting worker with pid: 84626
: 1-11-16 17:40:20 +0900] [84627] [INFO] Booting worker with pid: 84627
: 1-11-16 17:40:20 +0900] [84628] [INFO] Booting worker with pid: 84628
: 1-11-16 17:40:20 +0900] [84629] [INFO] Booting worker with pid: 84629
: ...

One thing to watch out for is that the default manifest points the extension service to port
5000, the default port for the Flask built-in web server.  If you want to use the extension
service when deploying to a different port, you'll want to be sure to override that part of the
manifest in your config file.  You'll need something like the following:

: MANIFEST = {
:     "extend": {
:         "propose_properties": {
:             "service_url": "http://localhost:8000",
:             "service_path": "/properties"
:         }
:     }
: }

Note also that the configuration is saved during the ~init~ step.  If you change the config,
you'll need to re-run that step.  You may also need to delete and re-add the service in
OpenRefine.

*** Deprecated The ~run~ subcommand mimics the old behavior which combined the initialization step with the running of the service. This may be removed in a future release.

** Common configuration

** Built-in preview service There is a preview service built into the tool. (Thanks [[https://github.com/b2m][b2m]]!) You can turn it on by adding the following to your manifest:

+begin_src python

 "preview": {
    "url": "http://localhost:5000/preview/{{id}}",
    "width": 400,
    "height": 300
 }

+end_src

Note that if you reconcile against a service with a preview service enabled, a link to the service becomes part of the project. Thus if you bring the service down, your project will have hover over pop-ups to an unavailable service. One way around this is to copy the ~recon.match.id~ to a new column which can be re-reconciled to the column by id if you bring the service back up again whether or not you have preview service enabled. (Perhaps OpenRefine could be smarter about enabling this pop-ups only when the service is active.)

** Scoring plugins As mentioned above the default scoring method is to use [[https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient][Dice coefficient scoring]], but this method can be overridden by implementing a ~cvs_reconcile.scorers~ plugin.

*** Implementing A plugin module may override any of the methods in the ~csv_reconcile.scorers~ module by simply implementing a method of the same name with the decorator ~@cvs_reconcile.scorer.register~.

See ~csv_reconcile_dice~ for how Dice coefficient scoring is implemented.

The basic hooks are as follows:

*** Installing Hooks are automatically discovered as long as they provide a ~csv_reconcile.scorers~ [[https://setuptools.readthedocs.io/en/latest/userguide/entry_point.html][setuptools entry point]]. Poetry supplies a [[https://python-poetry.org/docs/pyproject/#plugins][plugins]] configuration which wraps the setuptools funtionality.

The default Dice coefficent scoring is supplied via the following snippet from ~pyproject.toml~
file.

: [tool.poetry.plugins."csv_reconcile.scorers"]
: "dice" = "csv_reconcile_dice"

Here ~dice~ becomes the name of the scoring option and ~csv_reconcile_dice~ is the package
implementing the plugin.

*** Using If there is only one scoring plugin available, that plugin is used. If there are more than one available, you will be prompted to pass the ~--scorer~ option to select among the scoring options.

*** Known plugins See [[https://github.com/gitonthescene/csv-reconcile/wiki][wiki]] for list of known plugins.

** Testing Though I long for the old days when a unit test was a unit test, these days things are a bit more complicated with various versions of ~Python~ and installation of plugins to manage. Now we have to wrestle with [[https://docs.python.org/3/tutorial/venv.html][virtual environments]]. ~poetry~ handles the virtual environment for developing, but testing involves covering more options.

*** Tests layout The tests directory structure is the following:

: tests
:     main
:     plugins
:         geo

Tests for the main package are found under ~main~ and don't require installing any other
packages whereas tests under ~plugins~ require the installation of the given plugin.

* Running tests ** Basic tests These tests are written with [[https://docs.pytest.org/en/6.2.x/contents.html][pytest]] and can be running through ~poetry~ as follows:

 : $ poetry run pytest

 To avoid the complications that come from installing plugins, there is a ~poe~ script for
 running only the tests under main which can be invoked as follows:

 : $ poe test

 For steady state developing this is probably the command you'll use most often.

**** Build matrices The GitHub Actions for this project currently use a [[https://docs.github.com/en/actions/learn-github-actions/managing-complex-workflows#using-a-build-matrix][build matrix]] across a couple of architectures and several versions of ~Python~, but a similar effect can be achieved using [[https://nox.thea.codes/en/stable/tutorial.html][nox]].

 ~nox~ manages the creation of various virtual environments in what they call "sessions", from
 which various commands can be run.  This project's ~noxfile.py~ manages the installation of the
 ~csv-reconcile-geo~ plugin for the plugin tests as well as running across several versions of
 ~Python~.  See the ~nox~ documentation for detail.

 Some versions of this command you're likely to run are as follows:

 : $ nox      # Run all the tests building virtual environemnts from scratch
 : $ nox -r   # Reuse previously built virtual environments for speed
 : $ nox -s test_geo  # Run only the tests for the csv-reconcile-geo plugin
 : $ nox -s test_main -p 3.8   # Run only the main tests with Python3.8

 Eventually, the GitHub Actions may be changed to use [[https://github.com/marketplace/actions/setup-nox][setup-nox]].

** Future enhancements

It would be nice to add support for using [[https://reconciliation-api.github.io/specs/latest/#structure-of-a-reconciliation-query][properties]] as part of the scoring, so that more than one column of the csv could be taken into consideration.