Write code to run Analogy task over data sets

sehsanm commented 5 years ago

There must be an option to set following option:

Cosine distance
Euclidean Distance
Distance defined in " Omer Levy and Yoav Goldberg. Linguistic regularities in sparse and explicit word representations. In CoNLL , 2014"
Support batch operation
To have a threshold to accept the word if it is in top N result

abb4s commented 5 years ago

hi , I created a sample format for test script : analogy_test.py.zip but I need a sample data-set of a corpus , its corresponding semantic vector model and an analogy data-set . if you define and create architecture of project it would be more clearer . I assumed that we have a data-sets package for loading corpora and data-sets and models package for loading models.

sehsanm commented 5 years ago

Hi Some points here I'm assuming what you have sent is a psudo-code for what needs to be done. What you have to do is to create a package containing base methods for analogy task.

The final output from my perspective is that you load all datasets for analogy (maybe multiple) and then run them and finally create a CSV file to have results as well as categories score for each dataset . So you are not dependent on corpus, you are dependent on a memory loaded model (see : #15 )

So the psudo code will be something like :


import models
#load dataset
analog_datasets=datasets.loadAnalogyDataset('/data/analogy')
#load model
model = models.loadmodel('/data/models/model_khafan.bin' )
for dataset in analog_datasets 
    for row in data_set:
        r1= model.getVec(row.a)
        r2= model.getVec(row.b)
        r3=model.getVec(row.c)
        words=model.getKNear(r3+r2-r1,thershold , 'Cosine_Distance')
        totals[row.category] = totals[row.category] + 1
        if row.d in words:
            corrects[row.category] = =corrects[row.category] + 1
    write_result_to_file(data_set , totals , corrects)

So what you have to do :

Write the code to load one or more analogy data
consider the catgories on analogy
write the code to evaluate (with option of multiple distances cosine, euclidian, max, ....)
a helper method to store the result as CSV in a file for further process

abb4s commented 5 years ago

hi , thank you for instructions. I tried to implement requirements but I can't test it completely because we haven't model yet . result file is attached : scripts.zip

sehsanm / embedding-benchmark

Write code to run Analogy task over data sets #16