lgessler / glam

(WIP) a webapp for language documentation
Eclipse Public License 2.0
40 stars 3 forks source link

Zero-length tokens #42

Open lgessler opened 6 months ago

lgessler commented 6 months ago

Motivation

Consider:

  1. I saw a fish in the water.
  2. I saw fish-∅ in the water.

In (2), a lot of traditional item-and-arrangement (IA) accounts of English pluralization would have a zero allomorph of the plural morpheme coming after fish, as shown. Null allomorphy is of course common cross-linguistically, and many documentary linguists use similar IA accounts.

This poses a problem, since we require that tokens be anchored in a textual substring, and null morphs have no textual representation.

Proposed Feature

Following a discussion of how to handle this, we decided it was probably best to allow zero-length tokens. Tokens currently must contain at least one character, so lifting this would allow you to e.g. identify a zero-length substring beginning just after fish in (2) in order to have a null token which could host the annotations for the plural morpheme.

An issue with this is that if you have multiple null tokens in the same place there will be no way to tell their order. You could band-aid fix this in several ways, but for now, this seems niche enough that it's not worth handling.

lgessler commented 3 months ago

Having spent some time mulling this over, I think it may be better to do the following:

  1. Identify (U+2205) as a "special" character
  2. Require that all zero-length tokens actually map to a single-character substring of
  3. Optionally offer API support facilitating the creation of zero-length tokens by automatically expanding the linked text with

This avoids the issue of having exceptional handling of tokens, which seems over-complicated. The downside is that not all clients might want to have this visualized in the text, but having this behavior limited to a single character would allow clients to filter it out for display.