Before this patch, llm chat crashed once a conversation became sufficiently long (>4000 tokens by default). With this patch we just cleave off the beginning of the context to allow a conversation to continue. This is kind of hacky, (after context length reached the llm can only produce n_ctx-truncate_ctx tokens per response, and of course forgets anything it can't deduce from the last truncate_ctx tokens) so any suggestions welcome.
Before this patch,
llm chat
crashed once a conversation became sufficiently long (>4000 tokens by default). With this patch we just cleave off the beginning of the context to allow a conversation to continue. This is kind of hacky, (after context length reached the llm can only producen_ctx-truncate_ctx
tokens per response, and of course forgets anything it can't deduce from the lasttruncate_ctx
tokens) so any suggestions welcome.