vespa-engine / vespa

AI + Data, online. https://vespa.ai
https://vespa.ai
Apache License 2.0
5.49k stars 586 forks source link

Add more ways to split text to Vespa's indexing language for embedding models #27228

Open jobergum opened 1 year ago

jobergum commented 1 year ago

It would be great to represent more advanced text splitting for embedding models in the vespa indexing language than the current regular expression support split <regex>

document doc {
  field title type string {..}
  field text type string {..}
}
field embedding type tensor<float>(p{}, x[768]) {
  indexing {
    input text | split "," |
      for_each {
        "passage: " . (input title || "") . " " . ( _ || "")
      } | embed e5 | attribute
  }
}

The above demonstrates today's support. It would be great to have a target max length for each split and have different types of text splitters. The user can also use array<string> and perform the splitting outside of Vespa. Still, I think we should offer some better Out-of-Box support than simple regular expression splitting without any length limitations.

dainiusjocas commented 7 months ago

I was trying to split text into sentences using OpenNLP sentence detector (available as a tokenizer through Lucene Linguistics) and then pass the resulting sentences for embedding but the embedded takes the original text value.

indexing {
    "en" | set_language;
    input doc_body | tokenize | to_array | embed e5smallv2 | attribute | index;
}

The main idea was that with Lucene analyzers it is possible to split text in various ways, limit the length, etc.

Also, it is not possible to do a trick like doc_field_with_log_text -> synthetic_field_with_sentences -> synthetic_field_embedded.

jobergum commented 7 months ago

It's hard to understand your comment if it's a bug report missing features or something else.

bratseth commented 7 months ago

I think this a suggestion to add a to_array which does the splitting?

dainiusjocas commented 7 months ago

The main idea is that I've tried to use Linguistics to chunk the text. But I've learned that it is not supported. However, if that would be supported then with Lucene Linguistics it would be possible to chunk text into e.g. sentences that are no longer than 50 words.

to_array is here only because Vespa wouldn't complain. Hope this helps.

bratseth commented 7 months ago

Yes, thanks!