vespa-engine / vespa

AI + Data, online. https://vespa.ai
https://vespa.ai
Apache License 2.0
5.71k stars 594 forks source link

Is there any way to define wild card field in schema? #17302

Closed 107dipan closed 3 years ago

107dipan commented 3 years ago

Is there any way to define wildcard fields in our schema like in elastic search https://discuss.elastic.co/t/field-mapping-using-wildcards-newbie-question/76300

jobergum commented 3 years ago

No, vespa is strongly typed and all fields need to be defined explicitly so no there is no wild card field mapping in Vespa.

107dipan commented 3 years ago

Thanks.. Just wondering if there is a workaround for that like using a map or something...

Also, I was just trying to understand the vespa architecture and one thing I am struggling with is how buckets are stored in content node. Since all the content nodes have a documentDb for storing the indexes and all how is the bucket stored in the document db. Or Is the bucket a logical way of picturing how documents are split up and stored and not actual physical entities?

vekterli commented 3 years ago

Your assumption is correct; buckets are a logical abstraction on top of the underlying document store(s). They do not map down to physical files.

The relationship is many-to-many. A single bucket may span many document DBs (i.e. document types) and a single document DB may contain data for many buckets. The actual mapping from document <-> bucket is maintained via a separate metadata store. This also makes operations such as splitting very fast, as it doesn't need to move any data around on the disk.

107dipan commented 3 years ago

@vekterli When a bucket is split into two and redistributed. If one of the new bucket is moved to a different content node the data be moved, right?

vekterli commented 3 years ago

@107dipan that is correct. Vespa does not build on top of any sort of distributed FS, so data is kept entirely local per node and must therefore be copied upon redistribution. Though note that redistribution upon split only happens in practice when using a n= or g= document ID scheme, as these enable document co-locality.

Vespa will by default distribute buckets based on coarser levels of granularity known as super buckets. Every bucket is contained within a logical super bucket. The number of super buckets is several orders of magnitude greater than the number of nodes, statistically ensuring a reasonably even distribution across nodes as long as the documents themselves are evenly distributed across buckets. The default ID scheme satisfies this, as it is effectively just a hash of the ID. Since we already have an even distribution, there is no need to move bucket replicas to other nodes to maintain even resource usage. In this case, a split is primarily done to keep the amount of document metadata within configurable bounds per bucket.

With ID schemes that support co-location of data, document-to-bucket assignment may be very skewed (imagine 99% of all docs belonging to the same location) and we therefore may redistribute after splits for buckets containing many such documents.

The nitty gritty details of this can be found in the bucket design documentation. A lot of the information is very low-level, so feel free to ask any questions you may have.

vekterli commented 3 years ago

The mapping from buckets to distributors is decided through the ideal state algorithm, so any client in the system (such as a container node) that can run this algorithm can figure out the correct distributor to forward to.

This algorithm takes the following 3 inputs:

  1. the cluster configuration, containing the set of configured nodes, redundancy settings etc (provided by the config servers and inferred from the application package contents)
  2. the current state of the cluster's nodes (as determined by the cluster controller and propagated to all distributors and content nodes)
  3. a given bucket identifier

and outputs the desired node(s) as an ordered sequence ranked by descending priority. The distributor that is to have the responsibility for a given bucket is always the first node in this sequence. The mapping is completely deterministic given the same input.

Containers subscribe to the same cluster configuration as the distributors/content nodes and can map any document ID down to its target bucket identifier (this is a deterministic operation based on the contents of the ID). This gives us 2 out of the 3 required inputs for the algorithm.

The remaining part (the current cluster state) is not known initially by the client, so we have to bootstrap it. This is done by blindly and randomly sending the request to one of the distributors in the cluster. This is very likely to be the wrong distributor for the bucket in question (if it's not, that's perfectly fine), and if so the distributor will automatically bounce the request back with a response that contains the most up to date cluster state it knows of. The client will cache this state and use it to compute the desired distributor for any requests going forward.

Assuming the cluster is stable this lets us directly forward requests from clients to distributors without requiring any indirections such as directory lookups. If the cluster state does change, the client will pick up on this by eventually ending up hitting the wrong distributor for a request, at which point it will receive a newer cluster state and update its own internal cached state.