The <QueueSize> can be an integer in [0, MAXINT]. If it's 0, the request queue size is infinite. If it's a positive integer, when the request queue is full, incoming requests will be dropped (the HTTP status code of response will be 406).
This will cache <CacheSize> unique requests. And for each unique request, it cache <CacheListSize> different results. A random result will be returned if the cache is hit.
The <CacheSize> can be an integer in [0, MAXINT]. If it's 0, cache won't be applied. The <CacheListSize> can be an integer in [1, MAXINT].
How to run
Run OPT-125M:
Configure
Configure model:
Available models: opt-125m, opt-6.7b, opt-30b, opt-175b.
Configure tensor parallelism
The
<TensorParallelismWorldSize>
can be an integer in[1, #GPUs]
. Default1
.Configure checkpoint
The
<CheckpointPath>
can be a file path or a directory path. If it's a directory path, all files under the directory will be loaded.Configure queue
The
<QueueSize>
can be an integer in[0, MAXINT]
. If it's0
, the request queue size is infinite. If it's a positive integer, when the request queue is full, incoming requests will be dropped (the HTTP status code of response will be 406).Configure bathcing
The
<MaxBatchSize>
can be an integer in[1, MAXINT]
. The engine will make batch whose size is less or equal to this value.Note that the batch size is not always equal to
<MaxBatchSize>
, as some consecutive requests may not be batched.Configure caching
This will cache
<CacheSize>
unique requests. And for each unique request, it cache<CacheListSize>
different results. A random result will be returned if the cache is hit.The
<CacheSize>
can be an integer in[0, MAXINT]
. If it's0
, cache won't be applied. The<CacheListSize>
can be an integer in[1, MAXINT]
.Other configurations