suhasbhairav / JuliaMachineLearning

A program that makes use of AdaGram and generates results
0 stars 0 forks source link

import data from serelex database #4

Open alexanderpanchenko opened 9 years ago

alexanderpanchenko commented 9 years ago

Install English JoBimText distributional thesaurus to Serelex search engine

Background

A distributional thesaurus (DT) is a word similarity graph, where each node is a lexical item, such as noun, verb or multiword expression. Each edge links semantically similar words, such as synonyms, hypernyms and associations. Edge weight in this graph represents a measure of semantic similarity/relatedness of two words. Standard distributional thesauri are sparse, scale-free graphs with up to million of nodes. Most often, such graphs are obtained automatically with help of corpus analysis algorithms known as distributional models [1].

Visualisation of DT graphs is helpful for many reasons:

JoBimText project produced a number of high quality distributional thesauri for English language. Serelex.org is a lexico-semantic search engine that let user query and visualise such thesauri. Your goal is to import JoBimText thesauri into the format of Serelex [1].

Task description

  1. Install Serelex locally. See instructions here: https://github.com/PomanoB/lsse. The original Serelex database data are available here: http://panchenko.me/data/serelex/serelex-dump.sql.gz.
  2. Download the English DT data in CSV format here: http://sourceforge.net/projects/jobimtext/files/data/models/google_books__sim.gz/download
  3. Use the script to import the file to the MySQL database: https://github.com/PomanoB/lsse/blob/master/import_v2_mysql.js

If this doesn't work write a script in Python that loads JoBimText thesaurus into the Serelex Mysql database (english language). The script should take as input a DT and generate SQL file loadable by Mysql database.

  1. Check that JoBimText DT is available with the local copy of Serelex.
alexanderpanchenko commented 9 years ago

Import a new model.

suhasbhairav commented 9 years ago

Loaded Serelex database and imported the new model as well. Currently, there are three models in the database.

alexanderpanchenko commented 9 years ago

Why 3, not 4? You should have 3 base models + one new loaded by you.

suhasbhairav commented 9 years ago

There were two models when loaded from the actual serelex database and the other one using the .js script against the separately downloaded file.

suhasbhairav commented 9 years ago

Did I miss something?

alexanderpanchenko commented 9 years ago

normally, in the dump there are three models: for french, english and russian. please check if this so. you are supposed to add the 4th model. it may be so the dump contains only two models. if this is true, then everything is ok.

suhasbhairav commented 9 years ago

I'll re-run the script on a fresh database to see whether 3 models (French,English and Russian) are being created.

alexanderpanchenko commented 9 years ago

ok

suhasbhairav commented 9 years ago

I loaded the serelex dump on a fresh database. Only two models were loaded. Then I ran the script for the csv file to load another model. Currently, there are three models in the database.