michmech / irish-word-frequency

About 6,500 Irish lemmas ordered by corpus frequency, with noise removed.
Open Data Commons Open Database License v1.0
31 stars 7 forks source link

`cál` too high up? #3

Open eoghanmurray opened 5 years ago

eoghanmurray commented 5 years ago

Sorry just wanted to register a further issue although I know this is an old repository.
I'm wondering why cál is so high up the list as 'kale/cabbage' doesn't seem to merit such a high position.

Anyhow probably time I dived into creating a similar word frequency list myself from the source texts as then I'll be able to investigate myself!

eoghanmurray commented 5 years ago

Another one is comhalta which I presume is so high because the corpus contained a large number of legislative/legal text.

I've since acquired Liostaí Bhreacadh https://www.breacadh.ie/ (book) which covers top 500 words and divides up by spoken language vs. written.

Maybe a link to that would be appropriate on the front-page?

michmech commented 2 years ago

"Cál" is so high up probably because the New Corpus for Ireland has incorrectly lemmatized some occurrences of "cáil" as a form of "cál", whereas most of the time "cáil" is actually either its own lemma (a noun meaning "reputation", "famousness") or a non-standard compound of "cá bhfuil" ("cáil tú?" = "cá bhfuil tú?" = "where are you?").

The high score of "comhalta" is probably explainable as you say.