We view Large Language Models as stochastic language layers in a network, where the learnable parameters are the natural language prompts at each layer. We stack two such layers, feeding the output of one layer to the next. We call the stacked architecture a Deep Language Network - DLN
Given that the phi-2 context length is 2k, generating new prompts with a batch size larger than 10 tends to consume most of the time by repeatedly invoking model.generate until enough data points are discarded to fit in the model's max context size. This slows down the DLN optimization and the model server significantly.
This PR checks max_length prior to calling the endpoint, which avoids unnecessary calls and slow retries.
Given that the phi-2 context length is 2k, generating new prompts with a batch size larger than 10 tends to consume most of the time by repeatedly invoking model.generate until enough data points are discarded to fit in the model's max context size. This slows down the DLN optimization and the model server significantly.
This PR checks max_length prior to calling the endpoint, which avoids unnecessary calls and slow retries.