microsoft / aici

AICI: Prompts as (Wasm) Programs
MIT License
1.87k stars 76 forks source link

token length limit #98

Closed mmoskal closed 2 months ago

mmoskal commented 2 months ago

StackRecognizer currently has 130 byte limit on token length, and there is also assert!(word.len() < 0xff); in toktree.rs which has to do with toktree format.

We probably can just ignore tokens longer than 255 (so far I've seen them only in starcoder tokenizer - there is one slightly longer token of spaces.

CC @saikat107