some words getting detected for no reason

iamkittykat commented 4 months ago

like "its h" gets for sh_t same for n and more!

iamkittykat commented 4 months ago

monyasau commented 3 months ago

you're right, it tried it out too,i got the same outcome

monyasau commented 3 months ago

@iamkittykat @joschan21 , i think it is because the model splits the input into 3 characters then test the characters again known profane words and it ignores white spaces, for example "its h" would get split into something like this [ its, ith, ist, ish, iht, ihs, tis, tih, tsi, tsh, thi, ths, sit, sih, sti, sth, shi, sht, hit, his, hti, hts, hsi, hst ] the it will test all the items in the array against permutations of known profane words in this case "shit" and when it sees most of the items in the array match those of "shit" it marks it as profane

monyasau commented 3 months ago

i would have fixed it but the dataset for the model are too large for me to download

iamkittykat commented 2 months ago

okay

joschan21 / profanity.dev

some words getting detected for no reason #40