unitaryai / detoxify

Trained models & code to predict toxic comments on all 3 Jigsaw Toxic Comment Challenges. Built using ⚡ Pytorch Lightning and 🤗 Transformers. For access to our API, please email us at contact@unitary.ai.
https://www.unitary.ai/
Apache License 2.0
936 stars 114 forks source link

false positive #56

Open ghost opened 2 years ago

ghost commented 2 years ago

wtf? for some reason this message is flagged as toxic: "who selling lup pots" can you fix? using original data set

anitavero commented 2 years ago

Thanks for reporting this example. If you notice any pattern in the examples the models flag falsely as toxic, it would be very useful if you could share it. In order for us to improve the models some useful information would be:

smasterparth commented 2 years ago

Hey, Even I have come across this False Positive issue. I was working with a model to detect offensive text in a given dataset. For example, I had few records having string as Shital, which is a name, not an offensive word. So few of such records were being classified as Toxic while rest as Non-Toxic. Same was a case with records having word 'Nishit' that's also a name.

I tried to find out any pattern for being classified as toxic for few records and rest time as non-toxic, but nothing was there to be noticed.

Let me know if there's any work around you guys have come up or working on it.

anitavero commented 2 years ago

It matters a lot which version of the model you use: "original", "unbiased" or "multilingal".

from detoxify import Detoxify

input_text = ['Shital', 'Nishit', "who selling lup pots"]
model_u = Detoxify('unbiased')
model_o = Detoxify('original')
model_m = Detoxify('multilingual')

results_u = model_u.predict(input_text)
results_o = model_o.predict(input_text)
results_m = model_m.predict(input_text)

print("Original", pd.DataFrame(results_o, index=input_text).round(2))
print("Multilingual", pd.DataFrame(results_m, index=input_text).round(2))
print("Unbiased", pd.DataFrame(results_u, index=input_text).round(2))

This outputs:

Original                       toxicity  severe_toxicity  obscene  threat  insult  identity_attack
Shital                    0.82             0.01     0.57    0.00    0.05             0.00
Nishit                    0.71             0.04     0.52    0.01    0.39             0.24
who selling lup pots      0.00             0.00     0.00    0.00    0.00             0.00
Multilingual                       toxicity  severe_toxicity  obscene  identity_attack  insult  threat  sexual_explicit
Shital                    0.82             0.00     0.54              0.0    0.41     0.0             0.01
Nishit                    0.87             0.01     0.82              0.0    0.14     0.0             0.02
who selling lup pots      0.01             0.00     0.00              0.0    0.00     0.0             0.00
Unbiased                       toxicity  severe_toxicity  obscene  identity_attack  insult  threat  sexual_explicit
Shital                    0.67              0.0     0.21              0.0    0.03     0.0             0.52
Nishit                    0.06              0.0     0.01              0.0    0.01     0.0             0.00
who selling lup pots      0.01              0.0     0.00              0.0    0.00     0.0             0.00

Let us know if you find any other issues! If you could attach model outputs similar to the above one, that would be really helpful!

ogencoglu commented 12 months ago

@anitavero original model also outputs very high false positive toxicity value for the following text: "They had great sex!"

{'toxicity': 0.88951826, 'severe_toxicity': 0.0110040745, 'obscene': 0.4631456, 'threat': 0.0027411387, 'insult': 0.021174002, 'identity_attack': 0.0034398066}

ogencoglu commented 11 months ago

Also for this one: "Sucking power of this vacuum cleaner is great!"