Closed alexanderpanchenko closed 6 years ago
File with all the scores:
https://docs.google.com/spreadsheets/d/1uj7WNrBeavbFfOF487k5cUB51gUx8RDLvNZsHxSPDCE/edit?usp=sharing
I have entered the scores according to the new idea in a sheet named new nodes in the same file.
Nice! we got some improvements in terms of the F-score. Could you also compute for the environment domain please?
On 15 Aug 2018, at 19:10, Shantanu Acharya notifications@github.com wrote:
I have enterd the scores according to the new idea in a sheet named new nodes in the same file.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/shan18/taxi/issues/8#issuecomment-413266876, or mute the thread https://github.com/notifications/unsubscribe-auth/ABY6vk2H-QdH-1KskP92CLiB6RRVrIrqks5uRFYhgaJpZM4U7Sf_.
Yes, I'll do that. And I have cleaning up to do in the new code. After that's done, I'll send a pull request.
Great! Thanks
On Wed 15. Aug 2018 at 21:07, Shantanu Acharya notifications@github.com wrote:
Yes, I'll do that. And I have cleaning up to do in the new code. After that's done, I'll send a pull request.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/shan18/taxi/issues/8#issuecomment-413302455, or mute the thread https://github.com/notifications/unsubscribe-auth/ABY6vg1VZijZ0VhplWuN09ldfolBdGUCks5uRHFfgaJpZM4U7Sf_ .
while sending the pull request, should I keep the code for both the clustering and new nodes in separate files or should I remove the clustering code and keep only the new nodes version? If we remove the clustering code, it can always be accessed by going back to the previous commit.
And can you also upload the embeddings in the server so that I can include the download link in the README?
The embeddings are in the following folder in the ltgpu1 server:
/home/5aly/taxi/distributed_semantics/embeddings
Embeddings:
please update using this URL
http://ltdata1.informatik.uni-hamburg.de/taxi/ http://ltdata1.informatik.uni-hamburg.de/taxi/
On Aug 18, 2018, at 2:39 PM, Shantanu Acharya notifications@github.com wrote:
And can you also upload the embeddings in the server so that I can include the download link in the README? The embeddings are in the following folder in the ltgpu1 server: /home/5aly/taxi/distributed_semantics/embeddings
Embeddings:
embeddings_poincare_wordnet own_embeddings_w2v own_embeddings_w2v.trainables.syn1neg.npy own_embeddingsw2v.wv.vectors.npy — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/shan18/taxi/issues/8#issuecomment-414055241, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABY6vuMbj4F6c1atntUnaJaH4rucaFgpks5uSAsDgaJpZM4U7Sf>.
build the taxonomy as usual - based on a lookup from the resources (databases of hypernyms) and subbstring matching. this is the input to the algorithm.
try to improve the taxonomy by
gensim
library based on the fastText.cc vectors for English based on Wikipedia and/or CommonCrawl.Pre-trained word vectors learned on different sources can be downloaded below: https://fasttext.cc/docs/en/english-vectors.html
wiki-news-300d-1M.vec.zip: 1 million word vectors trained on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens). wiki-news-300d-1M-subword.vec.zip: 1 million word vectors trained with subword infomation on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens). crawl-300d-2M.vec.zip: 2 million word vectors trained on Common Crawl (600B tokens).
https://radimrehurek.com/gensim/models/fasttext.html
https://radimrehurek.com/gensim/models/fasttext.html#gensim.models.fasttext.FastText.load_fasttext_format
removing children which are not a good fit to other childen.
place the 'orphan' removed children to a more appropriate place in the taxonomy (a more appropriate 'familiy')
use https://github.com/nlpub/chinese-whispers-python