Open dantalyon opened 1 year ago
Um, I am not sure. Maybe ur process is getting stuck somewhere?
I think the issue is the input is not share to the second GPU. I have a similar issue with microsoft/bloom-deepspeed-inference-int8, if I repeat the input X time (X = # of GPUs) it will keep inference to get the output.
Instead, I tried the bigscience/bloom based on bloom-accelerate-inference.py it works well with interactive input.
I am trying to create a simple chatbot using bloom-7b1 model (may use bigger models later) based on bloom-ds-zero-inference.py. Here is my code:
I have not yet applied the post-processing of the output. This works fine if I run it with
but when I run it with
I am using two Tesla V100 GPUs. deepspeed==0.9.2 and torch==1.14.0a0+410ce96 and Python 3.8.10