Closed V-G-spec closed 1 month ago
Create a list of terms that we should remove. There should be a list of things on GitHub or sth. Check how NER fits into this. Remove sentences with people's names or replace with "
I would rather replace names with [MASK]
or [NAME]
or something.
NER could be used to detect landmarks and similar which would be connected to countries (-> nationalities) or religious places (-> religions).
Small issue: certain people are named after festivities (namely, Easter). Should we impose the limit of only considering as a person double names (Name, Surname - but I find it extremely limiting as most text will likely not include the surname due to formality inflections and there are people with more than one first name), or shall I write a filter to consider all the words that are both classified as festivities and personal names to be just festivities?
Vote with like for the first option and a <3 for the second option!
Good catch. But I don't see a drawback in considering such examples as both a name and a festivity? So we would filter these samples out if we want to avoid names or festivities.
String matching + NER. Create 3 sets of filters for religion, race and gender.