tamas-visy / cs4nlp-plmrb

Other
0 stars 0 forks source link

Data cleaning #2

Closed V-G-spec closed 1 month ago

V-G-spec commented 6 months ago

String matching + NER. Create 3 sets of filters for religion, race and gender.

tamas-visy commented 6 months ago

Create a list of terms that we should remove. There should be a list of things on GitHub or sth. Check how NER fits into this. Remove sentences with people's names or replace with ""?

tamas-visy commented 6 months ago

I would rather replace names with [MASK] or [NAME] or something.

NER could be used to detect landmarks and similar which would be connected to countries (-> nationalities) or religious places (-> religions).

SusannaDiV commented 5 months ago

Small issue: certain people are named after festivities (namely, Easter). Should we impose the limit of only considering as a person double names (Name, Surname - but I find it extremely limiting as most text will likely not include the surname due to formality inflections and there are people with more than one first name), or shall I write a filter to consider all the words that are both classified as festivities and personal names to be just festivities?

Vote with like for the first option and a <3 for the second option!

tamas-visy commented 5 months ago

Good catch. But I don't see a drawback in considering such examples as both a name and a festivity? So we would filter these samples out if we want to avoid names or festivities.