globalwordnet / gwadoc

documentation for things like relations and parts of speech used by wordnets
https://globalwordnet.github.io/gwadoc/
Creative Commons Attribution 4.0 International
12 stars 6 forks source link

How to store the data #1

Open goodmami opened 5 years ago

goodmami commented 5 years ago

We need to decide a good way to store the gwadoc data, but it's not yet clear what are the intended uses or who are the intended users beyond generating the HTML documentation. The current (not checked-in) data is a python file that fills dictionaries with data. If generating documentation is the only use, we may as well put it directly into restructuredText. If we want a Python API, e.g., to request the localized name, definition, reverse, etc. from OMW, then it might make sense to make Python classes (Sphinx's autodoc could possibly be used to generate the docs, then).

In either case we could store the data in a data file and transform it (perhaps with validation) into the target representation. I propose using TOML​. Even though it is relatively new and not in the standard library, it was chosen for Rust's package manager and for the future of Python packaging (see PEP-0518), so it has support by major projects.

Here's a what (part of) hypernym would look like:

[hypernym]

  [hypernym.name]
    en = "Hypernym"
    symbol = "⊃"
    ja = "上位語"

  [hypernym.def]
    en = "a word that is more general than a given word"
    pl = "Relacja łącząca znaczenie z drugim, ogólniejszym, niż to pierwsze, ale należącym do tej samej części mowy, co ono"
    ja = "当該synsetが相手synsetに包含される"

​There's some flexibility in TOML (but not as flexible as YAML, which is a good thing). Something like this would be equivalent, e.g., if you want to group all attributes by language:

[hypernym]
name.en = "Hypernym"​
def.en = "a word that is more general than a given word"
# etc...

And while I would like to place this file (gwadoc.toml or whatever) at the top level so it's more prominent for non-Python users/contributors, that would make it much more difficult to distribute with the project and for the python code to find when run. So it might go under gwadoc/gwadoc.toml instead.

As an alternative, if we don't care much about non-Python users, we could make a Python class like Relation and do things like this:

rels['hypernym'] = Relation(
    name={
        "en": "Hypernym",
        "ja": "上位語",
    },
    def={
        "en": "a word that is more general than a given word",
    }
)

Then query it like this:

>>> hypernym = rels['hypernym']
>>> hypernym.name['en']
Hypernym
fcbond commented 5 years ago

I think we might leave it as a python dictionary for the moment, and concentrate on using and extending it.

Converting to TOML looks like it may make it easier to edit down the road.

goodmami commented 5 years ago

For now I've settled on having data structures that behave like dictionaries or classes in that they allow for both key-lookup (e.g. rels['hypernym']['name']['en']) and dot-access (rels.hypernym.name.en). The former is useful when you have the relation or property name in a variable and prefer rels[relation] over getattr(rels, relation) while the latter is much simpler and makes editing the file easier. I also made the data structures raise errors on invalid keys/attributes and defined inventories of valid relations, forms, projects, languages, etc., in order to reduce errors caused by simple typos.

I'll leave this issue open as a feature request for future versions.