mr-martian / rebabel-format

Python library for interacting with reBabel data files
MIT License
1 stars 4 forks source link

Rethinking `tiers` Table #13

Open mr-martian opened 1 month ago

mr-martian commented 1 month ago

Currently features have the following schema:

CREATE TABLE tiers(
       id INTEGER PRIMARY KEY,
       tier TEXT,
       feature TEXT,
       unittype TEXT,
       valuetype TEXT,
       CHECK(valuetype = 'int' OR valuetype = 'bool' OR
             valuetype = 'str' OR valuetype = 'ref')
);

Having now made use of it, I find a few problems:

Hierarchy

In most cases, the value of tier has been an identifier of the source data format, such as UD or FlexText, but then for both of those, there are attempts at sub-namespaces with UD/FEATS and FlexText/en.

Additionally, the configuration files allow features to be referred to as either a single string ("UD:lemma") or as a structured object ({"tier": "UD", "feature": "lemma"}), which leads to a fair amount of complexity in the code.

A potential solution to both of these issues is to merge the tier and feature fields into a single string field (probably name) and in the cases where we care about the name as a structured value, split it on a separator (probably :).

Unit Type Specification

I wonder whether having features be per-unit type is actually strictly necessary. We could instead just specify that UD:lemma is a string feature and not restrict its application at the database level.

Alternatively, we could at least constrain the table so that two features with the same name can't be different types.

Reference Features

It has repeatedly become apparent that reference features behave differently from the other types. There are several different places in the code where reference features in particular are a non-implemented exception. If we implement edge features (#12), it might then be reasonable to simply discard reference features entirely.