Closed elshize closed 1 year ago
Base: 92.82% // Head: 93.00% // Increases project coverage by +0.17%
:tada:
Coverage data is based on head (
77e8a3c
) compared to base (6ddfabc
). Patch coverage: 100.00% of modified lines in pull request are covered.
:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.
A text analyzer consists of:
For example, we a common text analyzer for the document content would consist of (1) strip HTML filter, (2) standard English tokenizer, (3) lowercase filter, stemmer, and stopword remover.
A text filter takes a string input and returns the transformed string. The only implemented text filter at the moment is the one stripping HTML markup.
A tokenizer takes a string and returns a token stream. See
TokenStream
for the details. We currently implement English and whitespace tokenizers.A token filter takes a single token and returns a token stream. The stream can always return a stream containing a single token (1-1 transformation), potentially return no tokens (stopword removing), or return multiple words. None of our currently implemented filters returns multiple tokens, but in the future we can consider implementing filters that do some term expansion, like synonyms.
A text analyzer is used for parsing both queries and documents, after the content part is already extracted (either from colon-delimited query string or from document input format, such as TREC).
Fixes #494
There are still some outstanding items to do before merging: