michmech / irish-word-frequency

About 6,500 Irish lemmas ordered by corpus frequency, with noise removed.
Open Data Commons Open Database License v1.0
30 stars 6 forks source link

proper name removal? #2

Open eoghanmurray opened 5 years ago

eoghanmurray commented 5 years ago

Was wondering why 'dobhar' was appearing so high up in the list and after puzzling over the dictionary entries on focloir & teanglann, I remembered that Gaoth Dobhair would likely be a common Gaeltacht placename mentioned in the source texts. Just want to mention it as an issue if others' use this repository and add a query as to whether proper names were correctly identified? (I know Gaillimh is in the list and kept capitalized which is fine)

michmech commented 1 year ago

Yes, this is a problem that happens a lot when lemmatizing Irish-language texts. Irish-language placenames often consist of normal, perfectly meaningful words. It is difficult to (automatically) separate the occurrences of such words inside placenames from their occurrence outside placenames. This messes up the frequency statistics a bit, especially for frequent placenames.