mattico / elasticlunr-rs

A partial port of elasticlunr to Rust. Intended to be used for generating compatible search indices.
Apache License 2.0
52 stars 23 forks source link

Improve language API for more flexible tokenization #33

Closed mattico closed 2 years ago

mattico commented 3 years ago

Simplest Language trait:

pub trait Language {
    /// The name in English of the language
    const NAME: &'static str;
    /// The ISO 639-1 language code of the language
    const CODE: &'static str;
    /// Produces suitably simplified search tokens for inserting into the search index
    fn tokenize(&mut self, text: &str) -> Vec<String>;
    /// Returns a list of pipeline component names to be serialized with the index
    ///
    /// elasticlunr.js will use these component names to look up pipeline functions
    /// that were registered by `Pipeline.registerFunction`.
    fn pipeline(&self) -> Vec<String>;
}

If users wanted to modify a built-in language they would implement the trait on their own type using the building blocks (tokenizer, stemmer, stop word filter, etc.) which would still be exposed.

Alternatively expose the pipeline functions more directly:

pub trait Language {
    /// The name in English of the language
    const NAME: &'static str;
    /// The ISO 639-1 language code of the language
    const CODE: &'static str;
    /// Produces suitably simplified search tokens for inserting into the search index
    fn tokenize(&mut self, text: &str) -> Vec<String>;
    /// Returns a list of pipeline component functions to be executed sequentially on the tokens
    /// to simplify them for insertion into the search index
    fn pipeline(&self) -> Vec<&dyn PipelineFunction>;
}

pub trait PipelineFunction {
    /// The name of the pipeline function, used to serialize the pipeline
    /// 
    /// elasticlunr.js will use these names to look up pipeline functions
    /// that were registered by `Pipeline.registerFunction`.
    const NAME: &'static str;
    /// Process a single token
    fn run(&mut self, token: String) -> String;
}
mattico commented 2 years ago

Fixed in #45