Annif : Research Analyzers needed for pre-processing

scherztc commented 1 year ago

hortongn commented 1 year ago

Annif Analyzer Shootout: Comparing text lemmatization methods for automated subject indexing https://journal.code4lib.org/articles/16719

hortongn commented 1 year ago

The Code4Lib article Sean found on Annif analyzers: https://journal.code4lib.org/articles/16719

haitzlm commented 1 year ago

SnowballStemmer: This analyzer uses the Snowball stemming algorithm to reduce words to their base form. For example, it can convert "running" to "run" and "jogging" to "jog".
StopwordsFilter: This analyzer removes common stopwords from the text, such as "the", "and", "a", and "an". This can help to reduce noise in the input data and improve the performance of the machine learning algorithm.
Annif- LowerCaseFilter: This analyzer converts all text to lowercase. This can help to ensure that different variations of the same word are treated as the same token by the machine learning algorithm.
NgramVectorizer: This analyzer extracts n-grams from the text, where an n-gram is a sequence of n consecutive words. For example, if n=2, the analyzer would extract pairs of adjacent words from the text.
- WhitespaceTokenizer: This analyzer tokenizes the text based on whitespace, such as spaces and tabs. This can be useful for languages that use whitespace to delimit words, such as English.
- HTMLCleaner: This analyzer removes HTML tags from the text, which can be useful when working with web data.
- LanguageIdentifier: This analyzer attempts to identify the language of the text based on its content. This can be useful when working with multilingual datasets.

Stanza: Stanza is an NLP library developed by the Stanford NLP group. It provides a range of analyzers for tasks such as tokenization, sentence splitting, part-of-speech tagging, named entity recognition, and dependency parsing. Stanza is available for several languages, including English, Chinese, Arabic, and many others.
Upipe: Upipe is an NLP library developed by the Université Paris Diderot. It includes a range of analyzers for tasks such as tokenization, sentence splitting, part-of-speech tagging, morphological analysis, and dependency parsing. Upipe is designed to be highly customizable, with a modular architecture that allows users to easily add or remove components as needed.

uclibs / AI-Project