openeventdata / synset_validator

0 stars 0 forks source link

MANUAL FOR SETTING UP YOUR OWN VTA DICTIONARY DEVELOPMENT

We developed a tool to facilitate the translation of english CAMEO dictionary to other foreign languages. This tool is designed to provide the following support -

There are three phases of the translation steps

Step 1: Data Collection

Step 2: Feedback Collection from human coders.

Step 3: Statistical Compilation of the translated version

Step 1: Data Collection

We started by extracting verbs that are associated with some rule in the CAMEO dictionary. Then we collected the synonym sets for each of the verbs from Wordnet. The synonym set contained synonyms in English and the foreign language we are translating (i.e. Spanish, Arabic). An example is as following

{

    "conceptID": "abandon\_1",

    "concept": "

",

    "gloss": "  stop maintaining or insisting on; of ideas or claims; \"He abandoned the thought of asking for her hand in marriage\"; \"Both sides have to give up some claims in these negotiations\"  ",

    "sets": [{

            "lang": "en",

            "words": ["abandon", "give up"]

    }, {

            "lang": "es",

            "words": ["abandonar", "renunciar"]

    }, {  "lang" : "gr",

        "words" : [list-of-synonyms-in Greek]

    }]

}

Here sets attribute contains the synonyms in different languages (English and Spanish here). The verb is listed under concept attribute. There will be multiple entries found for the same verb. We identified them by conceptID. Each entry has gloss attribute and helps the coder to identify the context associated with the translated verbs.

We also extracted rules from CAMEO.2.0.txt hosted in github code repository of PETRARCH2 project.

Step 2: Feedback Collection from human coders.

After collecting the data, we displayed them through the web UI for human validation. For the rules, we displayed the english rules and associated translated rules in foreign language. The coders will be able to verify the translation suggested by previous coders and add their own translations.

For verb translation we asked the coder to provide two-stage feedback. First, they will select the appropriate synonym sets of the verb that matches with the context of the CAMEO code. Then, within the selected synonym sets, translated words are marked as correct/incorrect/ambiguous. After this step we get the translated verbs approved by the coder for a CAMEO code and CAMEO rule.

Step 3: Statistical Compilation of the translated version

After we collected all the feedbacks, we list the translated rules in the translated version of the dictionary. For synonyms, we first collected feedbacks on synonym sets. Based on the content of the "gloss", coders verify whether a synonym set is appropriate in the context of the CAMEO code and rule (feedback on synonym set level). Once a synonym set is marked as appropriate, we consider feedback on the words it contains (feedbacks at word level). At the end we consider feedbacks from all coders for a particular synonym set and make a majority based decision to identify whether it is suitable for inclusion as the translated verb for a CAMEO code and rule pair.

** We collected feedbacks based on gloss on synonym set from Spanish Coders. For Arabic and later languages, we use those feedbacks to identify the appropriate synonym sets and show the words in respective language for feedback collection at word level.

Inclusion of new language:

We are building the system so that we can easily integrate new languages for translation and dictionary generation purpose. For the time being the system is limited to the languages supported by WordNet. Here is the steps to get a translated version of the dictionary in other languages -

  1. A verified user will put request on which language he/she wants the translated version of the dictionary.
  2. The system in background collects data from wordnet.
  3. Update database with new entries and link with existing entities.
  4. New coder starts working on the downloaded dataset and provide their feedbacks

Once feedback collection process is completed, an offline tool for dictionary creation will download data and populate the dictionary.

Linking with existing entity:

Linking with existing entity is very important step as it helps to correctly show the information to the user. To illustrate this we are using the example of a synonym set. This information is stored in the database in 3 different entities.

  1. Word - contains the english word
  2. SynsetEntry - contains gloss, examples of a synonym set for a particular english word
  3. SynsetWord - contains words along with the language code which are part of a particular synonym set.

Example from the database is as follows -

SynsetEntry:

Let us consider 1 entry from SynsetEntry table.

GQL Query: select * from SynsetEntry where __key__ = KEY(SynsetEntry, 4504727190503424)

Output:

Name/ID gloss idWord source submissionId
id=** 4504727190503424 ** put into an upright position; "Can you stand the bookshelf up?" 4712951634198528 wordnet null

Finding the corresponding english word

GQL Query: select * from Word where __key__ = KEY(Word, 4712951634198528)
Name/ID text
id=4712951634198528 stand

Here is all the words that are part of the synonym set

GQL Query: select * from SynsetWord where idSynsetEntry = 4504727190503424

Output:

Name/ID idSynsetEntry languageCode submissionId word
id=4704686271627264 4504727190503424 ar null نهض
id=4722041060065280 4504727190503424 ar null قوم
id=4816410148601856 4504727190503424 en null place upright
id=4845423759982592 4504727190503424 ar null قاوم_البلى
id=4862778548420608 4504727190503424 ar null أبحر_في_إتجاه_معين
id=5126898736693248 4504727190503424 ar null وقف
id=5144253525131264 4504727190503424 ar null رشح
id=5267636225048576 4504727190503424 ar null قام
id=5284991013486592 4504727190503424 ar null رجع
id=5408373713403904 4504727190503424 ar null وقف_منتصبا
id=5425728501841920 4504727190503424 ar null اقم
id=5689848690114560 4504727190503424 ar null نصب
id=5707203478552576 4504727190503424 ar null حمل
id=5830586178469888 4504727190503424 ar null كان_في_موقف
id=5847940966907904 4504727190503424 ar null وجه
id=5942310055444480 4504727190503424 en null stand
id=5971323666825216 4504727190503424 ar null تزج
id=5988678455263232 4504727190503424 ar null بعد
id=6026654354767872 4504727190503424 en null stand up
id=6252798643535872 4504727190503424 ar null اطق
id=6270153431973888 4504727190503424 ar null إتخذ_موقف
id=6534273620246528 4504727190503424 ar null صطف
id=6551628408684544 4504727190503424 ar null ظل_قائما

So whenever we collect words translated in new language we have to adjust the reference linking depicted here. All are handled inside our system so that the end user can get simplified user experience. For example here, suppose we want to add German synonym set words for the english word "stand". For that, we first download all the synsets for the word "stand" with german translations. Now we match the gloss with each existing SynsetEntry and use the id of the matched one to be included in the SynsetWord entry for the new German words.