pisa-engine / pisa

PISA: Performant Indexes and Search for Academia
https://pisa-engine.github.io/pisa/book
Apache License 2.0
941 stars 65 forks source link

Whitespace tokenizer #496

Closed elshize closed 1 year ago

elshize commented 2 years ago

Implements a whitespace tokenizer next to the old term tokenizer. The old tokenizer is renamed to EnglishTokenizer (as it contains English-specific rules such as possessives), and both tokenizers are now organized into a class hierarchy with a common virtual interface, so that they could be used interchangeably in the future, as well so that perhaps more tokenizers can be implemented.

Related to #494

codecov[bot] commented 2 years ago

Codecov Report

Base: 92.87% // Head: 92.84% // Decreases project coverage by -0.03% :warning:

Coverage data is based on head (5e529fa) compared to base (78dbc6b). Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #496 +/- ## ========================================== - Coverage 92.87% 92.84% -0.04% ========================================== Files 92 92 Lines 4351 4332 -19 ========================================== - Hits 4041 4022 -19 Misses 310 310 ``` | [Impacted Files](https://codecov.io/gh/pisa-engine/pisa/pull/496?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=pisa-engine) | Coverage Δ | | |---|---|---| | [include/pisa/query/query\_stemmer.hpp](https://codecov.io/gh/pisa-engine/pisa/pull/496/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=pisa-engine#diff-aW5jbHVkZS9waXNhL3F1ZXJ5L3F1ZXJ5X3N0ZW1tZXIuaHBw) | `100.00% <100.00%> (ø)` | | | [include/pisa/tokenizer.hpp](https://codecov.io/gh/pisa-engine/pisa/pull/496/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=pisa-engine#diff-aW5jbHVkZS9waXNhL3Rva2VuaXplci5ocHA=) | `100.00% <100.00%> (ø)` | | Help us with your feedback. Take ten seconds to tell us [how you rate us](https://about.codecov.io/nps?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=pisa-engine). Have a feature suggestion? [Share it here.](https://app.codecov.io/gh/feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=pisa-engine)

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.