isharah-project / isharah-api

Ruby on Rails backend for isharah.com
https://www.isharah.com/
0 stars 0 forks source link

Seed words #35

Open Youssefares opened 5 years ago

Youssefares commented 5 years ago

Using the freely available book A Frequency Dictionary of Arabic, we're going to seed our dictionary with the most common arabic words.

The dictionary has 5000 words. We'll ignore some of the words which are dialect specific such as كويس, أيوا, اللي etc.

The first milestone will be to go through the top 2000 words of the book, adding the فصحى ones to this file. We only need the word string and its part of speech for now, so ignore all the other information in the book. (plurals, other forms, etc.)

Notes:

Abbreviation Type of Speech
conj conjunction
interj interjection
interrog interrogative
num number
part particle
prep preposition
pron pronoun
adj adjectives
adv adverbs
elat elative
n noun
v verb

We will copy the abbreviation into our seed file for now and we'll figure out the mapping from these abbreviations into arabic part of speech later, so an example row in the file will look like this: https://github.com/Youssefares/egsl-website-api/blob/master/db/words.csv#L9-L14

Youssefares commented 5 years ago

We'll work on the words in the following way @Youssefares 1-500 @TarekAlQaddy 501-1000 @yara11 1001-1500 @miralelnahas 1501-2000

Each of us has his own space inside of the file so we can work in parallel, without losing the words order. When we're done we'll remove the boundaries and comments.