jbrry / Irish-BERT

Repository to store helper scripts for creating an Irish BERT model.
Other
9 stars 0 forks source link

Mismatch between implementation and description of punctuation filter #98

Closed jowagner closed 2 years ago

jowagner commented 2 years ago

https://github.com/jbrry/Irish-BERT/blob/f2ed03f272cc39906374b91f42b7dd4393a50b55/filters/customfilters.py#L14 says the character percentage is returned. However, the code calculates the token ratio and only accepts a token as punctuation if is a single character. Which one is intended? Tokens vs characters + percentage vs. ratio (different denominator).

jbrry commented 2 years ago

Without opening OpusFilter just yet, it should be the characters and ratio. Assuming (for the moment) that sent is a string, the code should return the punctuation-to-character ratio. As far as I know, the code doesn't calculate the token ratio, unless I've missed something? Does the below example prove that it should be characters and ratio?

import string
punct = set(string.punctuation)

#sent = "Paris is the capital of?"
sent = ".!#@abcd"

words = sent.split()
if len(words) >= 1:
    num_chars = len([c for w in words for c in w])
    num_punct_chars = len([c for w in words for c in w if c in punct])
    punct_ratio = num_punct_chars / num_chars
    print(f"there are {num_chars} chars, {num_punct_chars} are punctuation. The punct-to-char ratio is: {punct_ratio}")
jowagner commented 2 years ago

Re-examining the code, I agree it counts characters and I think I did not understand that there is a nested loop in for w in words for c in w when I created this issue. Thanks for testing the code.