utterworks / fast-bert

Super easy library for BERT based NLP models
Apache License 2.0
1.86k stars 341 forks source link

how to manage output values of prediction (predictor.predict_batch) #290

Closed Elzic6 closed 3 years ago

Elzic6 commented 3 years ago

I use the predictor.predict_batch in a multilabels text classification process, that i apply on the validation set (test_set):

valid_predictions = predictor.predict_batch(test_texts)

valid_results = pd.DataFrame({'verbatim' : test_texts, 'prediction' : valid_predictions})

Labels predictions are grouped in a single field, and the function presents the predictions results in a hierrachical way (from the higher value of prediction to the lower values per label). so first review will have as output: [(label3, 0.99), (label6, 0,95)…etc]. image

image

Problem : from one row (review) to another, their appearance order differs as the values depend on the predictions.

even if using "columns" argument of pd.DataFrame, I get predictions dispatched in separate columns, but labels are mixed...

Capture d’écran 2021-05-15 à 14 26 08

Could there be a way to have a predict_batch function that could split prediction per label?

thanks for any suggestion! cheers

Pawel-Kranzberg commented 3 years ago

@Elzic6 - The current structure of predict_batch() is by design. As a workaround please try this approach:

test_texts = ['aaa', 'bbb', 'ccc']
valid_predictions = [[('b', 0.9), ('a', 0.07), ('c', 0.03)], [('a', 0.5), ('c', 0.3), ('b', 0.2)]]
results_by_labels = pd.DataFrame.from_dict({i: {k[0]: k[1] for k in r} for i, r in zip(test_texts, valid_predictions)}, orient = 'index')

results_by_labels = results_by_labels[sorted(results_by_labels)]
Elzic6 commented 3 years ago

works fine :) Thanks a lot for your support for solving this issue and others!