pisa-engine / pisa

PISA: Performant Indexes and Search for Academia
https://pisa-engine.github.io/pisa/book
Apache License 2.0
942 stars 65 forks source link

Implement TextAnalyzer #503

Closed elshize closed 1 year ago

elshize commented 1 year ago

A text analyzer consists of:

  1. zero or more text filters (e.g., strip HTML),
  2. a tokenizer,
  3. zero or more token filters (e.g., stemming).

For example, we a common text analyzer for the document content would consist of (1) strip HTML filter, (2) standard English tokenizer, (3) lowercase filter, stemmer, and stopword remover.

A text filter takes a string input and returns the transformed string. The only implemented text filter at the moment is the one stripping HTML markup.

A tokenizer takes a string and returns a token stream. See TokenStream for the details. We currently implement English and whitespace tokenizers.

A token filter takes a single token and returns a token stream. The stream can always return a stream containing a single token (1-1 transformation), potentially return no tokens (stopword removing), or return multiple words. None of our currently implemented filters returns multiple tokens, but in the future we can consider implementing filters that do some term expansion, like synonyms.

A text analyzer is used for parsing both queries and documents, after the content part is already extracted (either from colon-delimited query string or from document input format, such as TREC).


Fixes #494


There are still some outstanding items to do before merging:

codecov[bot] commented 1 year ago

Codecov Report

Base: 92.82% // Head: 93.00% // Increases project coverage by +0.17% :tada:

Coverage data is based on head (77e8a3c) compared to base (6ddfabc). Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #503 +/- ## ========================================== + Coverage 92.82% 93.00% +0.17% ========================================== Files 91 90 -1 Lines 4294 4476 +182 ========================================== + Hits 3986 4163 +177 - Misses 308 313 +5 ``` | [Impacted Files](https://codecov.io/gh/pisa-engine/pisa/pull/503?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=pisa-engine) | Coverage Δ | | |---|---|---| | [include/pisa/forward\_index\_builder.hpp](https://codecov.io/gh/pisa-engine/pisa/pull/503?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=pisa-engine#diff-aW5jbHVkZS9waXNhL2ZvcndhcmRfaW5kZXhfYnVpbGRlci5ocHA=) | `100.00% <ø> (ø)` | | | [include/pisa/cursor/max\_scored\_cursor.hpp](https://codecov.io/gh/pisa-engine/pisa/pull/503?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=pisa-engine#diff-aW5jbHVkZS9waXNhL2N1cnNvci9tYXhfc2NvcmVkX2N1cnNvci5ocHA=) | `96.29% <100.00%> (-3.71%)` | :arrow_down: | | [include/pisa/scorer/dph.hpp](https://codecov.io/gh/pisa-engine/pisa/pull/503?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=pisa-engine#diff-aW5jbHVkZS9waXNhL3Njb3Jlci9kcGguaHBw) | `100.00% <100.00%> (+100.00%)` | :arrow_up: | | [include/pisa/scorer/pl2.hpp](https://codecov.io/gh/pisa-engine/pisa/pull/503?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=pisa-engine#diff-aW5jbHVkZS9waXNhL3Njb3Jlci9wbDIuaHBw) | `100.00% <100.00%> (+100.00%)` | :arrow_up: | | [include/pisa/scorer/qld.hpp](https://codecov.io/gh/pisa-engine/pisa/pull/503?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=pisa-engine#diff-aW5jbHVkZS9waXNhL3Njb3Jlci9xbGQuaHBw) | `100.00% <100.00%> (ø)` | | | [include/pisa/text\_analyzer.hpp](https://codecov.io/gh/pisa-engine/pisa/pull/503?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=pisa-engine#diff-aW5jbHVkZS9waXNhL3RleHRfYW5hbHl6ZXIuaHBw) | `100.00% <100.00%> (ø)` | | | [include/pisa/cursor/block\_max\_scored\_cursor.hpp](https://codecov.io/gh/pisa-engine/pisa/pull/503?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=pisa-engine#diff-aW5jbHVkZS9waXNhL2N1cnNvci9ibG9ja19tYXhfc2NvcmVkX2N1cnNvci5ocHA=) | `77.41% <0.00%> (-9.25%)` | :arrow_down: | | [include/pisa/wand\_data\_range.hpp](https://codecov.io/gh/pisa-engine/pisa/pull/503?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=pisa-engine#diff-aW5jbHVkZS9waXNhL3dhbmRfZGF0YV9yYW5nZS5ocHA=) | `83.33% <0.00%> (-6.53%)` | :arrow_down: | | [include/pisa/filesystem.hpp](https://codecov.io/gh/pisa-engine/pisa/pull/503?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=pisa-engine#diff-aW5jbHVkZS9waXNhL2ZpbGVzeXN0ZW0uaHBw) | `80.00% <0.00%> (-5.72%)` | :arrow_down: | | [include/pisa/query/term\_processor.hpp](https://codecov.io/gh/pisa-engine/pisa/pull/503?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=pisa-engine#diff-aW5jbHVkZS9waXNhL3F1ZXJ5L3Rlcm1fcHJvY2Vzc29yLmhwcA==) | `95.65% <0.00%> (-4.35%)` | :arrow_down: | | ... and [78 more](https://codecov.io/gh/pisa-engine/pisa/pull/503?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=pisa-engine) | | Help us with your feedback. Take ten seconds to tell us [how you rate us](https://about.codecov.io/nps?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=pisa-engine). Have a feature suggestion? [Share it here.](https://app.codecov.io/gh/feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=pisa-engine)

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.