rhdunn / cainteoir-engine

The Cainteoir Text-to-Speech core engine
http://reecedunn.co.uk/cainteoir/
GNU General Public License v3.0
43 stars 8 forks source link

Support part of speech tagging for words #45

Open rhdunn opened 11 years ago

rhdunn commented 11 years ago

Part of speech tagging is the process of associating a word with the part of speech it is categorised as (noun, verb, adverb, etc.).

The tagging algorithm should be:

  1. If the word is in a partofspeech.dict dictionary, it has the part of speech from that dictionary.
  2. If the word matches a suffix in suffixes.dict it has the part of speech from that dictionary.
  3. If the word has not been matched, it is tagged as a noun.

This algorithm supports false positives in suffixes.dict by adding them to partofspeech.dict.

The parts of speech used should be described in a SKOS vocabulary (data/partsofspeech.rdf) and form a consistent taxonomy.

It should be possible to check the tagging against a manually tagged corpus -- preferrably a freely available/usable one.

It should be possible to associate a word with more than one tag (e.g. read and lead). This should feed into a disambiguation step that looks at the structure of the sentence.

The part of speech tag will then be used to differentiate word pronunciations.

The part of speech tagger should be an independant step, along with the sentence/grammar analysis step. As such they should be optional.

The tagger should support light analysis like eSpeak does to disambiguate words, but things like the "verb follows" should be done via part of speech rules (e.g. marking it as an adverb).

--- Want to back this issue? **[Post a bounty on it!](https://www.bountysource.com/issues/1026778-support-part-of-speech-tagging-for-words?utm_campaign=plugin&utm_content=tracker%2F254961&utm_medium=issues&utm_source=github)** We accept bounties via [Bountysource](https://www.bountysource.com/?utm_campaign=plugin&utm_content=tracker%2F254961&utm_medium=issues&utm_source=github).