vespa-engine / vespa

AI + Data, online. https://vespa.ai
https://vespa.ai
Apache License 2.0
5.68k stars 592 forks source link

Custom embedder with custom config definition #23675

Closed danitico closed 2 years ago

danitico commented 2 years ago

Describe the bug I am trying to set up a custom embedder that is really similar to BertBaseEmbedder. When trying to get the path of the tokenizer file, it does not return the absolute path to it but the path to directory in which it is contained.

To Reproduce In the BertBaseEmbedder class happens the same so one can try to set up an example using paths when referring to the vocabulary and the model

Expected behavior It should get the correct absolute path to the file

Screenshots The error I describe Screenshot 2022-08-16 at 10 19 12

services.xml Screenshot 2022-08-16 at 10 19 24

sentence-embedder.def Screenshot 2022-08-16 at 10 30 37

Environment (please complete the following information):

Vespa version 8.31.22

Additional context Add any other context about the problem here.

bratseth commented 2 years ago

Embedder configuration syntax allows models to be referenced by path, url or id, so you can say e.g <model path='...' id='...'/>. This is so it can be deployed both on self-hosted systems, using path (or url), and Vespa Cloud, using id.

However, it is only supported for certain well-known config definitions. For others you can use the regular config systems for setting a path, <model>...</model>. Doing that here solves this issue.

However, I guess this is a common expectation, so I've added support for using syntax in all embedders in https://github.com/vespa-engine/vespa/pull/23710