spraakbanken / korp-backend

Backend for Korp, Språkbanken's corpus search tool
https://spraakbanken.gu.se/eng/korp
MIT License
15 stars 6 forks source link
backend korp

Korp Backend

This is the backend for Korp, a corpus search tool developed by Språkbanken at the University of Gothenburg, Sweden.

The code is distributed under the MIT license.

The Korp backend is a Python 3 WSGI application, acting as a wrapper for Corpus Workbench.

To see what has changed in recent versions, see the CHANGELOG.

Requirements

To use the basic features of the Korp backend you need the following:

To use the additional features such as the Word Picture you also need:

For optional (but strongly recommended) caching you need:

Installing the required software

These instructions assume you are running a UNIX-like operating system (Linux, macOS, etc).

Corpus Workbench

Download the current stable version of Corpus Workbench. Install by following the Installing the CWB Core instructions, either by using the provided packages or building from source. Refer to the included INSTALL text file for further instructions.

Once CWB is installed, by default you will find it under /usr/local/cwb-X.X.X/bin (where X.X.X is the version number). Confirm that the installation was successful by running:

/usr/local/cwb-X.X.X/bin/cqp -v

CWB needs two directories for storing the corpora. One for the data, and one for the corpus registry. You may create these directories wherever you want, but from here on we will assume that you have created the following two:

/corpora/data
/corpora/registry

Setting up the Python environment and requirements

Optionally you may set up a virtual Python environment:

$ python3 -m venv venv
$ source venv/bin/activate

Install the required Python modules using pip with the included requirements.txt.

$ pip3 install -r requirements.txt

Configuring Korp

The supplied config.py contains the default configuration. To override the default configuration, make a copy of config.py and place it in a directory named instance in the repo root directory, and edit that copy.

The following variables need to be set for Korp to work:

If you are planning on using functionality dependent on a database, you also need to set the following variables:

For caching to work you need to specify both a cache directory (CACHE_DIR) and a Memcached server address or socket (MEMCACHED_SERVER).

Running the backend

To run the backend, simply run run.py:

python3 run.py

The backend should then be reachable in your web browser on the port you configured in config.py, for example http://localhost:1234.

During development or while testing your configuration, use the flag dev for automatic reloading.

python3 run.py dev

For deployment, Gunicorn works well.

gunicorn --worker-class gevent --bind 0.0.0.0:1234 --workers 4 --max-requests 250 --limit-request-line 0 'run:create_app()'

Cache management

Most caching is done using Memcached, except for CWB query results which are temporarily saved to disk to speed up KWIC pagination. While Memcached handles removing old cache by itself, you will still have to tell it to invalidate parts of the cache when one or more corpora are updated or added. This, and cleaning up the disk cache, is easily done by accessing the /cache endpoint. It might be a good idea to set up a cronjob or similar to regularly do this, making the cache maintenance fully automatic.

API documentation

The API documentation is available as an OpenAPI specification in docs/api.yaml, or online at https://ws.spraakbanken.gu.se/docs/korp.

Adding corpora

Korp works as a layer on top of Corpus Workbench for most corpus search functionality. See the CWB corpus encoding tutorial for information regarding encoding corpora. Note that Korp requires your corpora to be encoded in UTF-8. Values of structural attributes may not contain tab characters. Once CWB is aware of your corpora they will be accessible through the Korp API.

Adding additional info about the corpus

For Korp to show the number of sentences and the date when a corpus was last updated, you have to manually add this information. Create a file called ".info" in the directory of the CWB data files for the corpus, and add to it the following lines (editing the values to match your material). Be sure to end the file with a blank line:

Sentences: 12345
Updated: 2019-11-30
FirstDate: 2001-01-16 00:00:00
LastDate: 2001-01-30 23:59:59

Once this file is in place, Korp will be able to access this information.

Corpus structure requirements

To use the basic concordance features of Korp there are no particular requirements regarding the markup of your corpora.

To use the Word Picture functionality your corpus must adhere to the following format:

To use the Trend Diagram functionality, your corpus needs to be annotated with date information using the following four structural attributes: text_datefrom, text_timefrom, text_dateto, text_timeto. The date format should be YYYYMMDD, and the time format hhmmss. A corpus dated 2006 would have the following values:

Database tables

This section describes the database tables needed to use the Word Picture, Lemgram index and Trend Diagram features. If you don't need any of these features, you can skip this section.

Relations for the Word Picture

The Word Picture data consists of head-relation-dependent triplets and frequencies. For every corpus, you need five database tables. The table structures are as follows:

Table name: relations_CORPUSNAME  
Charset:    UTF-8  

Columns:  
    id             int                  A unique ID (within this table)  
    head           int                  Reference to an ID in the strings table (below). The head word in the relation  
    rel            enum(...)            The syntactic relation  
    dep            int                  Reference to an ID in the strings table (below). The dependent in the relation  
    freq           int                  Frequency of the triplet (head, rel, dep)  
    bfhead         bool                 True if head is a base form (or lemgram)  
    bfdep          bool                 True if dep  is a base form (or lemgram)  
    wfhead         bool                 True if head is a word form  
    wfdep          bool                 True if dep is a word form  

Indexes:  
    (head, wfhead, dep, rel, freq, id)  
    (dep, wfdep, head, rel, freq, id)  
    (head, dep, bfhead, bfdep, rel, freq, id)  
    (dep, head, bfhead, bfdep, rel, freq, id)

Table name: relations_CORPUSNAME_strings  
Charset:    UTF-8  

Columns:  
    id             int                  A unique ID (within this table)  
    string         varchar(100)         The head or dependent string  
    stringextra    varchar(32)          Optional preposition for the dependent  
    pos            varchar(5)           Part-of-speech for the head or dependent  

Indexes:  
    (string, id, pos, stringextra)  
    (id, string, pos, stringextra)

Table name: relations_CORPUSNAME_rel  
Charset:    UTF-8  

Columns:  
    rel            enum(...)            The syntactic relation  
    freq           int                  Frequency of the relation  

Indexes:  
    (rel, freq)  

Table name: relations_CORPUSNAME_head_rel  
Charset:    UTF-8  

Columns:  
    head           int                  Reference to an ID in the strings table. The head word in the relation  
    rel            enum(...)            The syntactic relation  
    freq           int                  Frequency of the pair (head, rel)  

Indexes:  
    (head, rel, freq)

Table name: relations_CORPUSNAME_dep_rel  
Charset:    UTF-8  

Columns:  
    dep            int                  Reference to an ID in the strings table. The dependent in the relation  
    rel            enum(...)            The syntactic relation  
    freq           int                  Frequency of the pair (rel, dep)  

Indexes:  
    (dep, rel, freq)

Table name: relations_CORPUSNAME_sentences  
Charset:    UTF-8  

Columns:  
    id             int                  An ID from relations_CORPUSNAME
    sentence       varchar(64)          A sentence ID (see the section about corpus structure above)  
    start          int                  The position of the first word of the relation in the sentence  
    end            int                  The position of the last word of the relation in the sentence  

Indexes:  
    id

In the main relations_CORPUSNAME table, each relation should be represented three times. Once with both dependent and head as base forms, once with dependent as base form and head as word form, and once with dependent as word form and head as base form. This is to allow searching for both base forms and word forms, giving different results for different searched word forms, while the results are always displayed as base forms. If the base form annotation is missing for a dependent, head or both, the word form can be used as both word form and base form by setting both bfhead/bfdep and wfhead/wfdep to True. In such a case you won't need all three rows for that relation.

The sentences table contains sentence IDs for sentences containing the relations, with start and end values to point out exactly where in the sentences the relations occur (1 being the first word of the sentence).

Lemgram Index

The lemgram index is an index of every lemgram in every corpus, along with the number of occurrences. This is used by the frontend to grey out auto-completion suggestions which would not give any results in the selected corpora. The lemgram index consists of a single MySQL table, with the following layout:

Table name: lemgram_index  
Charset:    UTF-8  

Columns:  
    lemgram      varchar(64)         The lemgram  
    freq         int                 Number of occurrences  
    freq_prefix  int                 Number of occurrences as a prefix  
    freq_suffix  int                 Number of occurrences as a suffix  
    corpus       varchar(64)         The corpus name  

Indexes:  
    (lemgram, corpus, freq, freq_prefix, freq_suffix)

Time data

For the Trend Diagram, you need to add token-per-time-span data to your database. For tokens without date or time info, use the date 0000-00-00 00:00:00. Use the following table layout:

Table name: timedata  
Charset:    UTF-8  

Columns:  
    corpus    varchar(64)        The corpus name
    datefrom  datetime           Full from-date and time
    dateto    datetime           Full to-date and time
    tokens    int                Number of tokens between from-date and (including) to-date

Indexes:  
    (corpus, datefrom, dateto)

Table name: timedata_date  
Charset:    UTF-8  

Columns:  
    corpus    varchar(64)        The corpus name
    datefrom  date               From-date (only date part)
    dateto    date               To-date (only date part)
    tokens    int                Number of tokens between from-date and (including) to-date

Indexes:  
    (corpus, datefrom, dateto)

Corpus Configuration for the Korp Frontend

The corpus configuration used by the Korp frontend is served by the backend. In config.py, the variable CORPUS_CONFIG_DIR should point to a directory having the following structure:

.
├── attributes
│   ├── positional
│   │   ├── lemma.yaml
│   │   ├── msd.yaml
│   │   ├── ...
│   │   └── pos.yaml
│   └── structural
│       ├── author.yaml
│       ├── title.yaml
│       ├── ...
│       └── year.yaml
├── corpora
│   ├── corpus1.yaml
│   ├── corpus2.yaml
│   ├── ...
│   └── yet-another-corpus.yaml
└── modes
    ├── default.yaml
    ├── another.yaml
    ├── ...
    └── other.yaml

For some inspiration, here are the config files used by the Korp instance at Språkbanken Text.

Note:
Most settings in these files referring to labels or descriptions can optionally be localized using ISO 639-3 language codes. For example, a label can look both like this:

label: author

... and like this:

label:
  eng: author
  swe: författare

Mode Configuration

At least one mode file is required, and that file must be named default.yaml. This is the mode that will be loaded when no mode is explicitly requested.

Required:

Optional:

Corpus Configuration

Corpus configuration files are placed in the corpora folder, and the filename of each configuration file should correspond to a corpus ID in lowercase, followed by .yaml, e.g. mycorpus.yaml.

Required:

Optional:

Attribute Presets

See pos_attributes and struct_attributes above.