Example from paper isn't working

Hello!

First of all, thank you for work and especially for Attentive Mimicking mechanism, it demonstrates good results in my research. Unfortunately, I can't reproduce results from paper. I just want to get a better embedding for word 'unicycle' with this simple command like you advised:

python3 ota.py --word unicycle --output_file inference_ota_embeds.txt --model_cls bert --model bert-base-uncased --iterations 4000

But OTA embedding seems to have nothing in common with embeddings for words 'unicycle' and 'bicycle' from original BERT model, cosine similarity score is under 0.1. Information from logs: loss is decreasing to negligibly small value, cosine is always -1.

Please, can you help me? May be I am doing something incorrectly.

Thank you!

timoschick / one-token-approximation

Example from paper isn't working #2