Advices for inference speedup

yishusong commented 2 months ago

Hi team,

I'm running inference on a g5.24xlarge GPU instance. The data is currently structured in a Pandas dataframe. I use Pandas apply method to apply the predict_entities function. When the df gets fairly large (~1.5M rows), it takes days to run the inference.

I'm wondering if there is a way to increase GPU utilization? I suppose Pandas df is not the most efficient data structure... or maybe there is a parameter I missed that can boost GPU utilization?

Any advice is much appreciated!

Marwen-Bhj commented 2 months ago

hello @yishusong , using Pandas apply method is slow, I suppose you want to use this model with a specific column and have the output in another maybe.

transform that column into a list, use model.batch_predict_entities(your_list, labels)
create a dictionnary from that output and join back with the dataframe

You would probably run OOM so make sure you run this in batches (split your data) and to run torch.cuda.empty_cache() .

About the increasing GPU utilization, I am not sure how we can increase or even make sure that it's using GPU during inference, I hope someone helps with that.

urchade commented 2 months ago

you can create batches like this

# Sample text data
all_text = ["sample text 1", "sample text 2", …, "sample text n"]

# Define the batch size
batch_size = 10

# Function to create batches
def create_batches(data, batch_size):
    for i in range(0, len(data), batch_size):
        yield data[i:i + batch_size]

# Example usage of the generator function
all_predictions = []
for batch in create_batches(all_text, batch_size):
    predictions = model.batch_predict(batch)
    all_predictions.extend(predictions)

yishusong commented 2 months ago

Thank you very much for the replies! I'll try it out shortly.

Re: @Marwen-Bhj 's comment about GPU... I haven't looked into the source code yet but is it possible to use the model with huggingface? I was thinking something like device_map = 'auto' to use all GPU, or make data type = float16 to make the data smaller. Does the code base offer configurations like this?

If not, maybe a memory optimized instance will perform better?

urchade commented 2 months ago

you can try the automatic mixed precision (AMP) module in PyTorch for inference. For me it helps speeding-up the training, but I have not tried inference

from torch.cuda.amp import autocast

with autocast(dtype = torch.float16):
    predictions = model.batch_predict(batch)

Marwen-Bhj commented 2 months ago

@urchade I tried AMP, it did not increase the inference speed. headsup @yishusong suprisingly, running inference on a CPU cluster is faster by at least 3 times than a GPU : CPU cluster : Intel(R) Xeon(R) Platinum 8175M CPU @ 2.50GHz GPU instance : Nvidia V100

yishusong commented 2 months ago

Thanks! With CPU there is joblib so there will be more speedup.

urchade commented 2 months ago

Ok, that's weird but ok 😅

Did you try to pass model.to('cuda') instead of model.cuda() ?

Marwen-Bhj commented 2 months ago

@urchade that fixed it ! thank you :)

yishusong commented 2 months ago

Thanks a lot! This indeed speed up inference a lot.

However, model.to('cuda') seems to only utilize 1 GPU. I looked up online, the nn.DataParallel(model) won't extend to GLiNER batch_predict...

lifepillar commented 2 months ago

I'm also interested in how to boost performance using multiple GPUs.

bartmachielsen commented 1 month ago

Hi, would it also be possible to speed up using AWS Inferentia / Optimum Neuron? (see article)

yishusong commented 1 month ago

I don't think inferentia works because it only supports a very limited list of HF models. Also it might not be compatible with CUDA so there might be other dependency issues.

vijayendra-g commented 2 weeks ago

@yishusong @Marwen-Bhj How were you able to achieve inference within seconds with cpu ? It takes close to 18-20 minutes with a medium 2.1 .

from gliner import GLiNER

model = GLiNER.from_pretrained("urchade/gliner_medium-v2.1")
model.to('cpu')

text = """
Cristiano Ronaldo dos Santos Aveiro (Portuguese pronunciation: [kɾiʃˈtjɐnu ʁɔˈnaldu]; born 5 February 1985) is a Portuguese professional footballer who plays as a forward for and captains both Saudi Pro League club Al Nassr and the Portugal national team. Widely regarded as one of the greatest players of all time, Ronaldo has won five Ballon d'Or awards,[note 3] a record three UEFA Men's Player of the Year Awards, and four European Golden Shoes, the most by a European player. He has won 33 trophies in his career, including seven league titles, five UEFA Champions Leagues, the UEFA European Championship and the UEFA Nations League. Ronaldo holds the records for most appearances (183), goals (140) and assists (42) in the Champions League, goals in the European Championship (14), international goals (128) and international appearances (205). He is one of the few players to have made over 1,200 professional career appearances, the most by an outfield player, and has scored over 850 official senior career goals for club and country, making him the top goalscorer of all time.
"""

labels = ["person", "award", "date", "competitions", "teams"]

entities = model.predict_entities(text, labels)

for entity in entities:
    print(entity["text"], "=>", entity["label"])

urchade / GLiNER

Advices for inference speedup #88