unicode-org / inflection

code, data and documentation related to handling inflection problems
Other
0 stars 1 forks source link

The best lexicon type/format to use #1

Open nciric opened 4 months ago

nciric commented 4 months ago

Lexicons are a critical part of the inflection project. They need to be used at the runtime, and will also be used by our tools for potential ML training.

We need to decide on the format we collect the data in. This decision needs to be based on multiple criteria:

  1. Is the format open, or under a friendly license?
  2. Can other lexicons be converted into that format, so we have consistent data>
  3. Is the format efficient, to reduce size & allow quick lookup?
  4. Quality of existing tools to operate on lexicon
  5. Can the lexicon data be easily pruned to what the user needs, to reduce deployment size?

An example tool and format, used in some universities (see languages):

  1. Unitex/GramLab from university in France, https://unitexgramlab.org/ (LGPL)
  2. Unitex lexicons (22 languages, with varied coverage), https://unitexgramlab.org/language-resources (LGPLLR)

They use Dela class of dictionaries (couldn't find a better link to describe Dela format).

What are other options we can use? Other criteria for selecting a lexicon/dictionary?

nciric commented 4 months ago

Another approach is to use UniMorph package.

BrunoCartoni commented 4 months ago

Before answering the great nciric@ questions, we also need to decide what is our endgoal:

(1) do we want to build/store a lexicon? (e‧g.: store "house, n: house,sing/houses,plur") (2) do we want to be able to generate inflection forms based on a "lemma" and grammatical info. (e‧g: input: house + plural output:houses (3) do we want to be able to analyse an inflected form and provide its lemma and grammatical info. (e‧g.: input: houses output: house, plur). (4) ... other?

If we can clarify these points first, then we can choose the lexicon format

JelenaMitrovic commented 4 months ago

Hello everyone, Jelena Mitrovic here, very excited to have been invited by @nciric to join this effort

@BrunoCartoni I have worked with UNITEX and DELA dictionaries for Serbian during my PhD (a while back). If we choose to go this route, the answers to your questions would be:

(1) the lexica DELAC (for simple words) DELAF (for compounds) do have all the forms available alongside the lemma already.

(2) the dictionary format contains this already - the problem is that dictionaries are limited, domain specific, and project-specific - so people build their own and do not share them. We would thus have to find people who are willing to supplement the existing resources, or share the ones they have.

(3) this would be ideal, and again, possible with DELA dictionaries.

Regarding UniMorph package, I do not have experience with it, but it seems to go well with Universal Dependancies. Inflection is dependant on syntax, so it might make sense to include at least the simplest UDs for each language.

The issues we are dealing with here are quite complex, and I hope to have a better understanding of the overall requirements for Unicode after our meeting.

nciric commented 4 months ago

Thanks Bruno and Jelena, see my answers below:

  1. We will need to build/store some words in the lexicon - e.g. exceptions to the rules.
  2. I would prefer to generate inflection forms where possible, to reduce the size of the lexicon (and lookup time). For example, for Serbian we would need 14 forms, including lemma. Generating them would reduce number to only 1.
  3. This is a secondary goal at the moment, but it's something we may get for free using Finite State Transducers as they often work both ways. I am sure there's a number of exceptions that would have to be stored in the lexicon. We need to see what trade off we need to make, if any, when deciding on this point.

My use case for the library is:

  1. Take a message format message with placeholders. Placeholders have annotated case, e.g. VOC, SINGULAR.
  2. Take in the parameters from the user, e.g. Beograd (Belgrade), Rim (Rome)
  3. Look up grammatical info from the lexicon, e.g. Beograd -> masculine, inanimate, Rim -> masculine, inanimate
  4. Pass the parameters to our new API, inflect("Beograd", grammatical_info), same for Rim, or directly to message format (which would then automatically do the necessary call)
  5. Get the formatted message

We can think of other scenarios, like building an index for search and asking for lemmatization, where your point 3) holds.

grhoten commented 4 months ago

It would be helpful to see a summary of what formats are available out there. Using a format that is compatible with the Unicode license is preferable.

Mihai did point out DMLex, which seems promising.

I like the idea of interoperability and leveraging existing repositories. On the other hand, I don't like restrictions on adding grammemes and other morphological information. For example, it's important to know the phonetic information or if a word starts or end with a vowel or consonant. That's important for grammatical agreement in several languages, including English, French and Korean. So if adding such information is difficult, I'd like to steer clear of such restrictions.

If a lexicon format can help sort a list of English adjectives correctly, that would be a strong format to consider, but it's not a requirement. Adjective order in English is a helpful problem to solve.

Conceptually, I'd like the data structured in a way to have a lemma associated with all its surface forms, and each surface form annotated with grammemes to differentiate it from other surface form under a lemma. As an example, Wiktionary has katt in Swedish, and it has a well annotated declension table and pronunciation.

I'm sure there will be discussions on what should be implicitly and explicitly in such structured lexicons, and I'd prefer have that as a separate topic. I'd also like to defer API and code discussion into a separate topic too.

BrunoCartoni commented 4 months ago

Thanks Nebojša for sketching out the main use case, and thank you all for the interesting conversation!

Based on Nebojša's use case, here is a first draft of the requirements:

  1. lexicon: a. store the main grammatical features of each entry b. (eventually) store all the forms of an entry with their grammatical features, specially for exception that are hard to handle with (2)

  2. A morphological generator that can generate the correct form in a placeholder, based on the grammatical features stored in the lexicon entry, and the specifications in the message.

So in Nebojša's examples: (1) will contain:

(2) will generate the correct form in the message according to the specification in the template (e.g. VOC, SINGULAR.)

and all the morphological or phonetic information mentioned by George Rhoten will be stored in (1).

Please let me know if we all agreed on these first principles?

As per the internal structure of the lexicon, we can leverage the "lexical masks" we developed (introduced in https://aclanthology.org/2020.lrec-1.372.pdf) that is already used by Wikidata.

Bruno

On Tue, Feb 27, 2024 at 9:15 AM George Rhoten @.***> wrote:

It would be helpful to see a summary of what formats are available out there. Using a format that is compatible with the Unicode license is preferable.

Mihai did point out DMLex https://docs.oasis-open.org/lexidma/dmlex/v1.0/csd02/dmlex-v1.0-csd02.html, which seems promising.

I like the idea of interoperability and leveraging existing repositories. On the other hand, I don't like restrictions on adding grammemes and other morphological information. For example, it's important to know the phonetic information or if a word starts or end with a vowel or consonant. That's important for grammatical agreement in several languages, including English, French and Korean. So if adding such information is difficult, I'd like to steer clear of such restrictions.

If a lexicon format can help sort a list of English adjectives correctly, that would be a strong format to consider, but it's not a requirement. Adjective order https://en.wikipedia.org/wiki/Adjective#Order in English is a helpful problem to solve.

Conceptually, I'd like the data structured in a way to have a lemma associated with all its surface forms, and each surface form annotated with grammemes to differentiate it from other surface form under a lemma. As an example, Wiktionary has katt https://en.wiktionary.org/wiki/katt#Swedish in Swedish, and it has a well annotated declension table and pronunciation.

I'm sure there will be discussions on what should be implicitly and explicitly in such structured lexicons, and I'd prefer have that as a separate topic. I'd also like to defer API and code discussion into a separate topic too.

— Reply to this email directly, view it on GitHub https://github.com/unicode-org/inflection/issues/1#issuecomment-1966003776, or unsubscribe https://github.com/notifications/unsubscribe-auth/BGM2AFC7BQI757HZ3YOUEDDYVWI3RAVCNFSM6AAAAABDT5WPB2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRWGAYDGNZXGY . You are receiving this because you were mentioned.Message ID: @.***>

--

Bruno Cartoni | (he/him) | Staff Linguist | Pride at Google Zürich Lead | @.*** | +41.79.246.80.46

nciric commented 4 months ago

Mihai did point out DMLex, which seems promising.

DMLex also sounds promising, thanks for linking.

I'm sure there will be discussions on what should be implicitly and explicitly in such structured lexicons, and I'd prefer have that as a separate topic. I'd also like to defer API and code discussion into a separate topic too.

I opened #3 for discussing APIs (and use cases).