Open jobergum opened 1 year ago
I was trying to split text into sentences using OpenNLP sentence detector (available as a tokenizer through Lucene Linguistics) and then pass the resulting sentences for embedding but the embedded takes the original text value.
indexing {
"en" | set_language;
input doc_body | tokenize | to_array | embed e5smallv2 | attribute | index;
}
The main idea was that with Lucene analyzers it is possible to split text in various ways, limit the length, etc.
Also, it is not possible to do a trick like doc_field_with_log_text -> synthetic_field_with_sentences -> synthetic_field_embedded
.
It's hard to understand your comment if it's a bug report missing features or something else.
I think this a suggestion to add a to_array which does the splitting?
The main idea is that I've tried to use Linguistics to chunk the text. But I've learned that it is not supported. However, if that would be supported then with Lucene Linguistics it would be possible to chunk text into e.g. sentences that are no longer than 50 words.
to_array
is here only because Vespa wouldn't complain. Hope this helps.
Yes, thanks!
It would be great to represent more advanced text splitting for embedding models in the vespa indexing language than the current regular expression support
split <regex>
The above demonstrates today's support. It would be great to have a target max length for each split and have different types of text splitters. The user can also use
array<string>
and perform the splitting outside of Vespa. Still, I think we should offer some better Out-of-Box support than simple regular expression splitting without any length limitations.