quickwit-oss / tantivy

Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust
MIT License
11.8k stars 655 forks source link

Token docs do not match Tokenizer behavior #939

Closed cormac-obrien closed 3 years ago

cormac-obrien commented 3 years ago

Describe the bug The docs for Token specify that the offsets range should be half-open:

The text that generated the token should be obtained by &text[token.offset_from..token.offset_to].

However, the SimpleTokenizer generates closed offset intervals:

original text: "The Old Man and the Sea"
tokens:
[
    Token { offset_from: 0, offset_to: 3, position: 0, text: "The", position_length: 1 },
    Token { offset_from: 4, offset_to: 7, position: 1, text: "Old", position_length: 1 },
    Token { offset_from: 8, offset_to: 11, position: 2, text: "Man", position_length: 1 },
    Token { offset_from: 12, offset_to: 15, position: 3, text: "and", position_length: 1 },
    Token { offset_from: 16, offset_to: 19, position: 4, text: "the", position_length: 1 },
    Token { offset_from: 20, offset_to: 23, position: 5, text: "Sea", position_length: 1 }
]

Which version of tantivy are you using? master (6d4b982)

To Reproduce

The above SimpleTokenizer output is taken from cargo run --bin pre_tokenized_text.

fulmicoton commented 3 years ago

I dont see any problem in the example above. Can you be extra specific on what you would have expected?

cormac-obrien commented 3 years ago

Ah, I realized I was thinking of the positions in the text after the spaces had been removed! There's no actual issue, closing.