mbakeranalecta / sam

Semantic Authoring Markdown
Other
79 stars 8 forks source link

Stemming support #140

Closed mbakeranalecta closed 6 years ago

mbakeranalecta commented 7 years ago

Stemming (reducing a word to a "stem" that is the same across all of its possible inflections is useful in matching things like annotated terms to definitions, links, or index markers. Stemming is probably something that should be supported in the application layer of a system, but for structured content, the main application layer tool is XSLT, which does not have stemming support nor any easy way to access a stemming library.

For this reason, it might make sense to include some kind of stemming capability in the SAM toolkit itself, either as part of the parser or as a tool you could wrap around the parser before handing the content off to the rest of the tool chain.

mbakeranalecta commented 7 years ago

Since stemming is language specific, and since there are various methods, any form of stemming support needs to be user selectable in some way.

It probably need to be human language sensitive as well, as in it need to pay attention to the in scope language tag, if there is one.

mbakeranalecta commented 6 years ago

I'm thinking that since there are different ways to do stemming, and it does not seem appropriate for SAM to pick one at the language level, that the thing to do might be to put the comparison in a method that a program would override, thus allowing it to introduce stemming, or any other means of lookup that it liked.

mbakeranalecta commented 6 years ago

Made annotation lookup mode extensible. There are two build in modes, 'case_sensitive' and "case_insensitive". A user can add other modes my writing a lookup function and adding it to the dictionary samparser.annotation_lookup_modes. This will allow the use to add stemming support if they want, or to do anything else, including overriding the built in modes. This needs to be documented.

mbakeranalecta commented 6 years ago

Documented in 706d275f6e2339dc21e135c78032824d69783b1c.