vespa-engine / vespa

AI + Data, online. https://vespa.ai
https://vespa.ai
Apache License 2.0
5.79k stars 604 forks source link

Ability to reference an arbitrary number of parent docs of the same type from a child #21833

Open mdelser opened 2 years ago

mdelser commented 2 years ago

Is your feature request related to a problem? Please describe. At our company we have a very complex reference and analytics database that enriches unstructured data (news stories, filings, transcripts, etc). We're currently indexing the news at sentence level into Vespa, and every sentence can have from 0 to 1000 entities detected from our knowledge graph, 90% of which have 10 or less.

For each sentence we have a collection of analytics. The majority of these analytics are directly tied to an entity in our knowledge graph and independent from the sentence while others are directly related to the sentence itself.

Currently we have a struct array in our main schema where each struct has 10 fields (subject to grow). Many are independent from the sentence and also require an update on a daily basis.

To solve this problem we want to leverage the parent/child configuration that Vespa supports so that we can simply update the parents without having to update all the children. What we have realized however is that because these currently exist as an array of N size we are unable to leverage the parent/child relationship as it does not support importing parent attributes in an array.

Describe the solution you'd like Be able to reference parent docs from an struct array or any other solution that fulfills the same use case.

Describe alternatives you've considered Flattening out the struct array and having 10 references of the same parent_type:

This solution creates a significant amount of additional maintenance overhead and forces us to restrict the number of analytics we can have on a child to a fixed amount. Maintenance overhead aside, forcing a restricted number of analytics is counter productive for our value proposition as we want to be sure we identify all analytics on a particular sentence for our users

Another considered solution was to duplicate the sentences and have only one reference on each of them, but the majority of our sentences have 5 or more entities and that would reduce a lot the number of sentences that we can store.

Additional context The number of parent documents that we want to index are in the order of millions while the child documents (sentences) are in the order of billions.

bratseth commented 2 years ago

Could you also say something about how you wish to use these fields? Grouping, ranking, search, or include in the summary.

mdelser commented 2 years ago

Hi Jon, mostly for grouping (getting counts and potentially clustering in the future), ranking (linear and ML) and searching, and in some cases summary.

johans1 commented 2 years ago

Unfortunately we will not prioritize this now