Closed jowagner closed 2 years ago
Without opening OpusFilter just yet, it should be the characters and ratio. Assuming (for the moment) that sent
is a string, the code should return the punctuation-to-character ratio. As far as I know, the code doesn't calculate the token ratio, unless I've missed something? Does the below example prove that it should be characters and ratio?
import string
punct = set(string.punctuation)
#sent = "Paris is the capital of?"
sent = ".!#@abcd"
words = sent.split()
if len(words) >= 1:
num_chars = len([c for w in words for c in w])
num_punct_chars = len([c for w in words for c in w if c in punct])
punct_ratio = num_punct_chars / num_chars
print(f"there are {num_chars} chars, {num_punct_chars} are punctuation. The punct-to-char ratio is: {punct_ratio}")
Re-examining the code, I agree it counts characters and I think I did not understand that there is a nested loop in for w in words for c in w
when I created this issue. Thanks for testing the code.
https://github.com/jbrry/Irish-BERT/blob/f2ed03f272cc39906374b91f42b7dd4393a50b55/filters/customfilters.py#L14 says the character percentage is returned. However, the code calculates the token ratio and only accepts a token as punctuation if is a single character. Which one is intended? Tokens vs characters + percentage vs. ratio (different denominator).