embed.text: implement infererence_mode="local" based on Embed4All

bmschmidt commented 7 months ago

Third question is whether this should include tests.

In general though this looks a good valuable addition, thanks @cebtenzzre !

cebtenzzre commented 7 months ago

What is the behavior here if I pass in a single string of length 10,000 tokens?

With default parameters (long_text_mode=truncate), the input will be truncated to 2048 tokens, including the prefix.

What guarantees of consistency do we want to have between the cloud service and nomic embed local? If I put that long string in to both places, will I get the same value back?

The Embed4All backend has been updated to be fully compatible with the chunking and truncation behavior of the Nomic embedding API, especially when Embed4All is used with atlas=True. In the testing I've done, the result has always been the same within a very small margin of error (mostly due to floating point inaccuracies, with a few very minor tokenizer differences for certain non-latin characters that are arguably bugs on the HF side).

What about at 4096 tokens (our context window is 8192, but IIRC we chop up into smaller chunks than that in atlas_cloud--cc @apage43 )

The input will still be truncated to 2048 tokens. The longest input you can use with long_text_mode=mean is 8192 tokens (excluding the prefix); any more than that and an error will be raised.

cebtenzzre commented 7 months ago

I added some tests for embed.text, and fixed a few bugs. Outstanding issues:

Increasing n_ctx doesn't actually allow any more sequences to be embedded in one forward pass of Embed4All.
Attempting to initialize Embed4All on the GPU more than once crashes. This could previously be reproduced by calling embed.text with a non-CPU device and changing the Embed4All kwargs:
```
from nomic import embed
embed.text(['x'], inference_mode='local', device='gpu')
embed.text(['x'], inference_mode='local', device='gpu', n_ctx=4096)
```
For now, we raise NotImplementedError in this case.
Embedding an empty string with Embed4All crashes. For now, we raise NotImplementedError in this case. Related to ggerganov/llama.cpp#6498, which also introduces a breaking change for the Nomic Embed GGUFs and has not yet been included in Embed4All.

nomic-ai / nomic

embed.text: implement infererence_mode="local" based on Embed4All #287