open-compass / MixtralKit

A toolkit for inference and evaluation of 'mixtral-8x7b-32kseqlen' from Mistral AI
Apache License 2.0
763 stars 81 forks source link

support alternative parallelism #2

Open 152334H opened 9 months ago

152334H commented 9 months ago

--num-gpus is implemented by sharding each expert layer across GPUs, i.e. expert parallelism

this is probably not advisable for local experimentation, especially on batch size 1 -- where EP only adds communication overhead to no speed benefit vs naive model/pipeline parallel.

tonysy commented 9 months ago

Good suggestions, I am working on other parallelism method. Also, contribution is welcomed.