marella / chatdocs

Chat with your documents offline using AI.
MIT License
684 stars 99 forks source link

chunksize and max_seq_length of embedding not matching #39

Closed 94bb494nd41f closed 1 year ago

94bb494nd41f commented 1 year ago

AFAIK the default of the length measure RecursiveCharacterTextSplitter is len while it is some token measure for the instrutor embeddings.

The programm still works, however chunks inserted into the database a smaller than one would suspect.

marella commented 1 year ago

As long as the context size of model is higher than the tokens in chunk, it should be fine. Since tokenization is dependent on model, RecursiveCharacterTextSplitter or other preprocessors will not know about the tokens.