pylexibank
is a python package providing functionality to curate and aggregate
Lexibank datasets.
At the core of the curation functionality provided by pylexibank
lies integration
with the metadata catalogs Glottolog,
Concepticon and CLTS.
Not all releases of these catalogs are compatibly with all versions of
pylexibank
.
pylexibank | Glottolog | Concepticon | CLTS |
---|---|---|---|
2.x | >=4.x | >=2.x | 1.x |
3.x | >=4.x | >=2.x | >=2.x |
Since pylexibank
has quite a few dependencies, installing it will result in installing
many other python packages along with it. To avoid any side effects for your default
python installation, we recommend installation in a
virtual environment.
Now you may install pylexibank
via pip or in development mode following the instructions
in CONTRIBUTING.md.
Installing pylexibank
will also install cldfbench
, which in turn installs a cli command cldfbench
. This command is used
to run pylexibank
functionality from the command line as subcommands.
cldfbench
is also used to manage reference catalogs, in particular Glottolog,
Concepticon and CLTS. Thus, after installing pylexibank
you should run
cldfbench catconfig
to make sure the catalog data is locally available and pylexibank
knows about it.
pylexibank
can be used in two ways:
lexibank
curation workflow.pylexibank
package can also be used like any other python package in your own
python code to access lexibank data in a programmatic (and consistent) way.cmd_makecldf
methodThe main goal of pylexibank
is creating high-quality CLDF Wordlists. This
happens in the custom cmd_makecldf
method of a Lexibank dataset. To make this task
easier, pylexibank
provides
args.glottolog.api
points to an instance of CachingGlottologAPI
(a subclass of pyglottolog.Glottolog
)args.concepticon.api
points to an instance of CachingConcepticonAPI
(a subclass of pyconcepticon.Concepticon
)Dataset.form_spec
, an instance
of pylexibank.FormSpec
which can be customized per
dataset. FormSpec
is meant to capture the rules that have been used when compiling
the source data - for cases where the source data violates these rules, wholesale
replacement by listing a lexeme in etc/lexemes.csv
is recommended.pylexibank.models
etc_dir
segments
package with orthography profile(s):
etc/orthography.tsv
, a segments.Tokenizer
instance, initialized with this profile, will be available as Dataset.tokenizer
and automatically used by LexibankWriter.add_form
.etc/orthography/
exists, all *.tsv
files in it will be considered
orthography profiles, and a dict
mapping filename stem to tokenizer will be available. Tokenizer
selection can be controlled in two ways:profile=FILENAME_STEM
in Dataset.tokenizer()
calls.Dataset.tokenizer
chose the tokenizer by item['Language_ID']
.While some level of support for reading and writing any CLDF dataset is already provided by the pycldf
package, pylexibank
(building on cldfbench
) adds another layer of abstraction which supports
pip
),pylexibank
enabled datasetsTurning a Lexibank dataset into a (pip
installable) Python package is as simple as writing a setup script setup.py
.
But to make the dataset available for curation via pylexibank
, the dataset must provide
pylexibank.Dataset
, which specifies
Dataset.dir
: A directory relative to which the the curation directories are located.Dataset.id
: An identifier of the dataset.lexibank.dataset
entry point in setup.py
. E.g.
entry_points={
'lexibank.dataset': [
'sohartmannchin=lexibank_sohartmannchin:Dataset',
]
},
Turning datasets into pylexibank
enabled python packages has multiple advantages:
cmd_install
code should install from that rather than requiring an external dependency.pylexibank
:
setup.py
, i.e. require pylexibank>=1.x.y
cldf-metadata.json
using prov:createdBy
property (pylexibank
will take care of this when the CLDF is created via lexibank makecldf
).GitHub provides a very good platform for collaborative curation of textual data such as Lexibank datasets.
Dataset curators are encouraged to make use of features in addition to the just version control, such as
Note that for datasets curated with pylexibank
, summary statistics will be written to README.md
as part of the makecldf
command.
In addition to the support for collaboratively editing and versioning data, GitHub supports tying into additional services via webhooks. In particular, two of these services are relevant for Lexibank datasets:
git
as follows:
git checkout tags/vX.Y.Z
git tag -a "clics-2.0"
git push origin --tags
There are multiple levels of contributions to a Lexibank dataset:
pylexibank
curation workflow involves adding code, mapping to reference catalogs and to some extent also linguistic judgements. These contributions are listed in a dataset's CONTRIBUTORS.md
and translate to the list of authors of released versions of the lexibank dataset.