neelsmith / tabulae

A build system for Latin morphological parsers
2 stars 0 forks source link

Add a LemmatizedCorpus class modelled on script in Livy repository #147

Closed neelsmith closed 5 years ago

neelsmith commented 5 years ago

But better. Given an analyzable, univocal OHCO2 Corpus,

  1. use an MidOrthography to tokenize it, and filter for lexical tokens
  2. generate a unique word list
  3. parse the word list
  4. map tokens to lemmata
  5. map tokens to analyses
neelsmith commented 5 years ago

Or compare current work in ocre-texts repository, and use of its FormulaUnit class and object.

neelsmith commented 5 years ago

Not doing in this library: see https://github.com/neelsmith/latin-corpus