segment-any-text / wtpsplit

Toolkit to segment text into sentences or other semantic units in a robust, efficient and adaptable way.
MIT License
695 stars 39 forks source link

Scoring metric, does definition make sense? #104

Closed asusdisciple closed 1 year ago

asusdisciple commented 1 year ago

I looked more into the scoring metric and noticed something. You score based on the indices of predicted sentences. However if you for example split two sentences and predict two arbitrary indeces (true indeces that is), lets say [ 23, 83] the scoring is only based on they index 23. Why is that? Because we score the splits, two sentences equal to one split so while 23 marks the split 83 only marks the end of the sentence. This makes sense in a way ... or maybe not - i am not sure. Because if you think about it, even if the algorithm does not recognize the last symbol as the end of a sentence it will still give the index 83, since it is given by [len(s) in predicted_sentences]. Lets assume you have three sentences now, which have the true indeces [23, 83, 140, 158] and lets say for some reason wtpsplit cant recognize the middle sentence. It would return [23, 140, 158] and a smaller f1 score. However if I would input the sentences separately like this [23, 83] and [140, 158] the f1 score would be 1, because 83 and 158 are never considered for scoring. This makes the score dependent on the number of sentences. For example if I score an dataset by aggregating two lines (which represent a sentence) in a loop the results would be much better than if I did with with 5 lines or even 10. There is also an risk involved in losing data, except you take the last sentence of each iteration into the next. Sorry for the text blob, but maybe you guys know a best practice for such a problem :)

bminixhofer commented 1 year ago

Hi, sorry for the late reply.

Also, fyi, it's hard for me to parse text blobs like this, it would be helpful to structure it a bit more.

To answer your questions: