hpcaitech / EnergonAI

Large-scale model inference.
Apache License 2.0
630 stars 90 forks source link

Added parallel code for chatglm-6B #225

Open Caesar1993 opened 1 year ago

Caesar1993 commented 1 year ago

Added parallel code for chatglm-6B. Due to the small number of parameters, the inference speed is not as fast as single card loading, but it can be referenced in GLM models with larger parameter quantities for inference.

  1. Split the mixed qkv vectors in chatglm on the huggingface into multiple heads, then take out the qkv of each head, and finally concatenate them into a whole qkv
  2. Write the layer definition of chatglm into init, and rebuild the forward function according to the basic layer in Colossalai