Open Piskvor opened 6 years ago
@Piskvor Thanks for the ticket. Currently the compare function compares words to a set list of bad words. These are forked from https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words.
The compare function uses the specific language list when a suffix is used name:en
-> only compared with English bad words. name:es
-> Spanish bad words. But it compares against all the main languages defined here if it is a name
tag without a language suffix and flags false positives due to this.
Like you said, it would be good to have description of which exact word got flagged for profanity. This needs adding support in OSMCha database, api, and then have osm-compare compare functions return descriptions. Tagging in @willemarcel to write his thoughts.
Thanks for the explanation.
Specifically in OSM, this looks easier than a general (contextless, text-only) matching:
The method for comparing against main (sic!) languages is probably a useful heuristic even for the OSM name corpus, but since we do have geographical data both for the feature at hand, and for countries, it would IMNSHO make sense to check for the local language as well.
Usually, the name
tag contains the name in the local language. This might be repeated in a name:*
tag, but it's not guaranteed (as seen at top: single name
tag in local language only). Many countries do have an official (or de facto) local language(s - the Swiss have 4). This would help with choosing the most likely language.
In pseudocode: which country bounding box(es) contain (any nodes of) the feature (changeset?)? If any, add their language(s), if any, to the (front of?) checkable list.
I'm aware that this only looks easy as pseudocode, but "let's check for Spanish and Russian obscenities" is not very useful in locations where Spanish or Russian name=
is probably an error anyway (for unrelated reasons - e.g. personal mapping).
Usually, the name tag contains the name in the local language.
I think this is an excellent point and an excellent suggestion - making the assumption that name
without a language suffix is english has problems, as we can see, as it is very common for name
to be the name in the local language of the area.
To expand on the above pseudo-code a little bit:
For center point of feature, lookup "country" the feature belongs to. This could be an external geocoding lookup, or a lookup against a set of country polygons.
Create a mapping of country to language.
Lookup local language for where the feature is.
Use local language list of "bad words" for profanity check.
My guess is there maybe points on the earth where you cannot get the country, or you do not have a valid language mapping, and there it's probably fine to default to english.
I guess the problem would be for egregious insults in english or another language to show up in a country that has a different local language, and hence that profanity is not detected. Here one suggestion might be to have a smaller list of "really bad words across languages" that one may want to flag for review regardless of local language.
To another point here: it would be helpful to have better descriptions of the various detectors, and a way for users on the osmcha
website to be able to see these descriptions, and also be able to read the code that is running them. Also to clarify that of course these detectors will never be (even close to) 100% "correct" - the idea is to just make it easier for humans to filter down things to inspect and give an interface for that.
Thank you again for the ticket @Piskvor
+cc @willemarcel
Sure, that's why I suggested "add the local language(s), if any, with the list of languages that are checked by default."
rm
is one of their languages)default_language
tagged (not quite the same as official language - e.g. the U.S. of A. do not have an official language, even though English is the default language; official language is irrelevant here however), this gets complicated with multilanguage countries, and the tag is not required, e.g. Belgium doesn't have it.To vastly reduce the false positive rate of the profanity comparator only the first entry from the zh.json
needs to be removed.
It reads "13.",
This entry causes every1 changeset to be flagged where the number 13
is included in the tag values checked by the comparator.
I stumbled upon this problem via MapRoulette and the "Profanity OSMCha detections" challenge (more info also in my other comment on a related issue) this single line fix would probably reduce the the changesets tagged as profanity by ~70%.
1 To be precise tags where the 13
are the last two characters in the tag value are not flagged due to how the comparator works (regex check)
Fix merged, but apparently bug persists: another changeset flagged as "profanity" even though completely clean - a way matches the "13." if interpreted as regex - https://osmcha.org/changesets/102995578?aoi=e76ef7d4-5ae9-4c96-ad14-98ab138d19d2
If a feature is flagged as indecent, it would be helpful to see what exactly triggered the flag and in what language - otherwise the users are left scratching their heads over an edit that looks completely benign in the language(s) they know.
Example: https://osmcha.mapbox.com/changesets/60306121/ , specifically https://www.openstreetmap.org/relation/1293828 is tagged "Profanity tag", why? All the text are official stop names - not rude, not naughty, not even a double entendre or a pun - yet there it is. It is quite possible that "U Waltrovky" is considered the rudest possible curse in some language (infinite monkeys and birthday paradox and whatnot), but there is no indication which.