joschan21 / profanity.dev

1.37k stars 104 forks source link

some words getting detected for no reason #40

Open iamkittykat opened 4 months ago

iamkittykat commented 4 months ago

like "its h" gets for sh_t same for n and more!

iamkittykat commented 4 months ago

image

monyasau commented 3 months ago

you're right, it tried it out too,i got the same outcome

monyasau commented 3 months ago

@iamkittykat @joschan21 , i think it is because the model splits the input into 3 characters then test the characters again known profane words and it ignores white spaces, for example "its h" would get split into something like this [ its, ith, ist, ish, iht, ihs, tis, tih, tsi, tsh, thi, ths, sit, sih, sti, sth, shi, sht, hit, his, hti, hts, hsi, hst ] the it will test all the items in the array against permutations of known profane words in this case "shit" and when it sees most of the items in the array match those of "shit" it marks it as profane

monyasau commented 3 months ago

i would have fixed it but the dataset for the model are too large for me to download

iamkittykat commented 2 months ago

okay