lexibank / pylexibank

The python curation library for lexibank
Apache License 2.0
17 stars 7 forks source link

pylexibank

Build Status PyPI

pylexibank is a python package providing functionality to curate and aggregate Lexibank datasets.

Compatibility

At the core of the curation functionality provided by pylexibank lies integration with the metadata catalogs Glottolog, Concepticon and CLTS. Not all releases of these catalogs are compatibly with all versions of pylexibank.

pylexibank Glottolog Concepticon CLTS
2.x >=4.x >=2.x 1.x
3.x >=4.x >=2.x >=2.x

Install

Since pylexibank has quite a few dependencies, installing it will result in installing many other python packages along with it. To avoid any side effects for your default python installation, we recommend installation in a virtual environment.

Now you may install pylexibank via pip or in development mode following the instructions in CONTRIBUTING.md.

Installing pylexibank will also install cldfbench, which in turn installs a cli command cldfbench. This command is used to run pylexibank functionality from the command line as subcommands.

cldfbench is also used to manage reference catalogs, in particular Glottolog, Concepticon and CLTS. Thus, after installing pylexibank you should run

cldfbench catconfig

to make sure the catalog data is locally available and pylexibank knows about it.

Usage

pylexibank can be used in two ways:

The cmd_makecldf method

The main goal of pylexibank is creating high-quality CLDF Wordlists. This happens in the custom cmd_makecldf method of a Lexibank dataset. To make this task easier, pylexibank provides

Programmatic access to Lexibank datasets

While some level of support for reading and writing any CLDF dataset is already provided by the pycldf package, pylexibank (building on cldfbench) adds another layer of abstraction which supports

Installable and pylexibank enabled datasets

Turning a Lexibank dataset into a (pip installable) Python package is as simple as writing a setup script setup.py. But to make the dataset available for curation via pylexibank, the dataset must provide

Turning datasets into pylexibank enabled python packages has multiple advantages:

Conventions

  1. Dataset identifier should be lowercase and either:
    • the database name, if this name is established and well-known (e.g. "abvd", "asjp" etc),
    • \<author>\<languagegroup> (e.g. "grollemundbantu" etc)
  2. Datasets that require preprocessing with external programs (e.g. antiword, libreoffice) should store intermediate/artifacts in ./raw/ directory, and the cmd_install code should install from that rather than requiring an external dependency.
  3. Declaring a dataset's dependence on pylexibank:
    • specify minimum versions in setup.py, i.e. require pylexibank>=1.x.y
    • specify exact versions in dataset's cldf-metadata.json using prov:createdBy property (pylexibank will take care of this when the CLDF is created via lexibank makecldf).

Datasets on GitHub

GitHub provides a very good platform for collaborative curation of textual data such as Lexibank datasets.

Dataset curators are encouraged to make use of features in addition to the just version control, such as

Note that for datasets curated with pylexibank, summary statistics will be written to README.md as part of the makecldf command.

In addition to the support for collaboratively editing and versioning data, GitHub supports tying into additional services via webhooks. In particular, two of these services are relevant for Lexibank datasets:

Attribution

There are multiple levels of contributions to a Lexibank dataset: