timoschick / form-context-model

This repository contains the code for the Form-Context Model and its Attentive Mimicking variant.
Apache License 2.0
31 stars 3 forks source link

Embd File #2

Closed Apoorva99 closed 5 years ago

Apoorva99 commented 5 years ago

Hello there, I was hoping to utilize your repo for one of my project where I need embeddings for rare words. I was trying to implement your attention mimicking code but in the output embeddings, I am getting embeddings for entire sentences instead of words. Is it supposed to be like this or is there some issue? It would be really great if you can help me out on this. Cheers, Apoorva

timoschick commented 5 years ago

Hi Apoorva,

Attentive Mimicking can be used to obtain embeddings for words. Given one such word, the algorithm assumes that you have several sentences in which this word occurs. It does, however, also work for words if you have no contexts for them at all.

You should be able to get embeddings for words from a trained model if you follow these two steps:

Step 1 Write all words for which you want to get embeddings into a single file (newline-separated). For each such word, also provide all contexts that you have available. For example, let's assume that you want to infer embeddings for the words apples and oranges, and you have two contexts for oranges (let's say, i like oranges and i bought two oranges) and no context for apples. Then your input file (let's call it input.txt) should look like this:

apples oranges<TAB>i like oranges<TAB>i bought two oranges

Note that <TAB> should be replaced by an actual tab character.

Step 2 The acutal inference can then be done using the fcm/infer_vectors.py script:

python3 fcm/infer_vectors.py -m MODEL_PATH -i input.txt -o output.txt

Afterwards, the file output.txt contains embeddings for apples and oranges, the content of this file should look like this:

apples 0.12345 0.23456 -0.12345 ... oranges 0.23554 -0.12345 0.34343 ...

Best regards, Timo