hpcaitech / EnergonAI

Large-scale model inference.
Apache License 2.0
630 stars 90 forks source link

[opt] add opt-6.7b and add fastapi server #162

Closed ver217 closed 1 year ago

ver217 commented 1 year ago

How to run

Run OPT-125M:

python opt_fastapi.py opt-125m

Configure

Configure model:

python opt_fastapi.py <model>

Available models: opt-125m, opt-6.7b, opt-30b, opt-175b.

Configure tensor parallelism

python opt_fastapi.py <model> --tp <TensorParallelismWorldSize>

The <TensorParallelismWorldSize> can be an integer in [1, #GPUs]. Default 1.

Configure checkpoint

python opt_fastapi.py <model> --checkpoint <CheckpointPath>

The <CheckpointPath> can be a file path or a directory path. If it's a directory path, all files under the directory will be loaded.

Configure queue

python opt_fastapi.py <model> --queue_size <QueueSize>

The <QueueSize> can be an integer in [0, MAXINT]. If it's 0, the request queue size is infinite. If it's a positive integer, when the request queue is full, incoming requests will be dropped (the HTTP status code of response will be 406).

Configure bathcing

python opt_fastapi.py <model> --max_batch_size <MaxBatchSize>

The <MaxBatchSize> can be an integer in [1, MAXINT]. The engine will make batch whose size is less or equal to this value.

Note that the batch size is not always equal to <MaxBatchSize>, as some consecutive requests may not be batched.

Configure caching

python opt_fastapi.py <model> --cache_size <CacheSize> --cache_list_size <CacheListSize>

This will cache <CacheSize> unique requests. And for each unique request, it cache <CacheListSize> different results. A random result will be returned if the cache is hit.

The <CacheSize> can be an integer in [0, MAXINT]. If it's 0, cache won't be applied. The <CacheListSize> can be an integer in [1, MAXINT].

Other configurations

python opt_fastapi.py -h