algorithm for improving a noisy taxonomy with distributional semantics - Githubissues

shan18 / taxi

Taxonomy refinement method to improve domain-specific taxonomy systems.

Apache License 2.0

0 stars 0 forks source link

algorithm for improving a noisy taxonomy with distributional semantics #8

Closed alexanderpanchenko closed 6 years ago

alexanderpanchenko commented 6 years ago

build the taxonomy as usual - based on a lookup from the resources (databases of hypernyms) and subbstring matching. this is the input to the algorithm.
try to improve the taxonomy by

load word embeddings using gensim library based on the fastText.cc vectors for English based on Wikipedia and/or CommonCrawl.

import gensim

w2v_fpath = "fasttext.model.txt"
w2v = gensim.models.KeyedVectors.load_word2vec_format(w2v_fpath, binary=True, unicode_errors='ignore')
w2v.init_sims(replace=True)
for word, score in w2v.most_similar(u"tree"):
    print word, score

Pre-trained word vectors learned on different sources can be downloaded below: https://fasttext.cc/docs/en/english-vectors.html

wiki-news-300d-1M.vec.zip: 1 million word vectors trained on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens). wiki-news-300d-1M-subword.vec.zip: 1 million word vectors trained with subword infomation on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens). crawl-300d-2M.vec.zip: 2 million word vectors trained on Common Crawl (600B tokens).

https://radimrehurek.com/gensim/models/fasttext.html

https://radimrehurek.com/gensim/models/fasttext.html#gensim.models.fasttext.FastText.load_fasttext_format

removing children which are not a good fit to other childen.
place the 'orphan' removed children to a more appropriate place in the taxonomy (a more appropriate 'familiy')
use https://github.com/nlpub/chinese-whispers-python

shan18 commented 6 years ago

File with all the scores:

https://docs.google.com/spreadsheets/d/1uj7WNrBeavbFfOF487k5cUB51gUx8RDLvNZsHxSPDCE/edit?usp=sharing

shan18 commented 6 years ago

I have entered the scores according to the new idea in a sheet named new nodes in the same file.

alexanderpanchenko commented 6 years ago

Nice! we got some improvements in terms of the F-score. Could you also compute for the environment domain please?

On 15 Aug 2018, at 19:10, Shantanu Acharya notifications@github.com wrote:

I have enterd the scores according to the new idea in a sheet named new nodes in the same file.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/shan18/taxi/issues/8#issuecomment-413266876, or mute the thread https://github.com/notifications/unsubscribe-auth/ABY6vk2H-QdH-1KskP92CLiB6RRVrIrqks5uRFYhgaJpZM4U7Sf_.

shan18 commented 6 years ago

Yes, I'll do that. And I have cleaning up to do in the new code. After that's done, I'll send a pull request.

alexanderpanchenko commented 6 years ago

Great! Thanks

On Wed 15. Aug 2018 at 21:07, Shantanu Acharya notifications@github.com wrote:

Yes, I'll do that. And I have cleaning up to do in the new code. After that's done, I'll send a pull request.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/shan18/taxi/issues/8#issuecomment-413302455, or mute the thread https://github.com/notifications/unsubscribe-auth/ABY6vg1VZijZ0VhplWuN09ldfolBdGUCks5uRHFfgaJpZM4U7Sf_ .

shan18 commented 6 years ago

while sending the pull request, should I keep the code for both the clustering and new nodes in separate files or should I remove the clustering code and keep only the new nodes version? If we remove the clustering code, it can always be accessed by going back to the previous commit.

shan18 commented 6 years ago

And can you also upload the embeddings in the server so that I can include the download link in the README? The embeddings are in the following folder in the ltgpu1 server:
/home/5aly/taxi/distributed_semantics/embeddings

Embeddings:

embeddings_poincare_wordnet
own_embeddings_w2v
own_embeddings_w2v.trainables.syn1neg.npy
own_embeddings_w2v.wv.vectors.npy

alexanderpanchenko commented 6 years ago

please update using this URL

http://ltdata1.informatik.uni-hamburg.de/taxi/ http://ltdata1.informatik.uni-hamburg.de/taxi/

On Aug 18, 2018, at 2:39 PM, Shantanu Acharya notifications@github.com wrote:

And can you also upload the embeddings in the server so that I can include the download link in the README? The embeddings are in the following folder in the ltgpu1 server: /home/5aly/taxi/distributed_semantics/embeddings

Embeddings:

embeddings_poincare_wordnet own_embeddings_w2v own_embeddings_w2v.trainables.syn1neg.npy own_embeddingsw2v.wv.vectors.npy — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/shan18/taxi/issues/8#issuecomment-414055241, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABY6vuMbj4F6c1atntUnaJaH4rucaFgpks5uSAsDgaJpZM4U7Sf>.