michmech / irish-word-frequency

About 6,500 Irish lemmas ordered by corpus frequency, with noise removed.
Open Data Commons Open Database License v1.0
31 stars 7 forks source link

bigrams? #1

Closed eoghanmurray closed 5 years ago

eoghanmurray commented 5 years ago

Hi, I'm interested in the script / methodology used to construct this list.

Specifically, 'coinne' comes up quite high in the frequency list, but I imagine that's because of it's use in phrases such as 'i gcoinne' (against), 'gan choinne' (unexpectedly) & 'os coinne' (in front of/opposite).

From a language learning pov, I'd like to learn these phraselets separately, so my idea is to allow bigrams alongside high frequency words. E.g. given the corpus frequency for 'coinne' as 8507, maybe the above 3 phrases have (say) frequencies of 4000, 3000, and 1000, in which case, they would appear in the top 6,500 list and bump the plain 'coinne' version off the list (which would now have a frequency score 507 after subtracting the bigram frequency).

Is the source code for how this list was created available?

With thanks!

michmech commented 5 years ago

Hi,

Your assumption about coinne is exactly right. Also, your idea to produce a frequency list where you would mix multi-word units and individual words is a good idea. Go ahead an do it!

I'm afraid I have no source code and no methodology to share for my list, though. It was a nixer I did many moons ago and, frankly, I don't remember any more how exactly I went about it.

eoghanmurray commented 5 years ago

That's unfortunate but no problem! I'm currently wondering how wise it is to dive into another open ended problem space :D Your linked sources should definitely be enough to get started!

One question in case you do recall; did you do any stemming etc. on verbs/lenition etc. E.g. 'chuirfidh' -> 'cuir' (and whether there are known resources on that).

With many thanks for making this available; in case you don't know it forms the basis of the following: https://ankiweb.net/shared/info/1975966926 which I'm hoping to improve on

eoghanmurray commented 5 years ago

I see in your blog article that you mention how Irish is 'Highly Periphrastic'. https://michmech.github.io/awesome-irish/ Thanks for making all these great resources available :D