Instruction-following degrades after several minutes in a session

I've noticed that when testing out my assistant, it tends to do a great job following the instructions in my system prompt for the first 1-2 minutes of a call. Then, as we start getting to minutes 3-5, it will make mistakes that are specifically addressed in the prompt. Has anyone else noticed the performance get worse over time?

I'm wondering if this has to do with how the context / state is managed internally to the API. I assume the system prompt is always maintained, while the audio stream starts dropping old audio (in a FIFO manner)?

I see this in the API docs:

If a conversation goes on for a sufficiently long time, the input tokens the conversation represents may exceed the model’s input context limit (e.g. 128k tokens for GPT-4o). At this point, the Realtime API automatically truncates the conversation based on a heuristic-based algorithm that preserves the most important parts of the context (system instructions, most recent messages, and so on.) This allows the conversation to continue uninterrupted.

And just want to confirm that the "heuristic-based algorithm" will always include the system instructions. Any extra detail you can provide is helpful, too!

openai / openai-realtime-api-beta

Instruction-following degrades after several minutes in a session #40