makcedward / nlp

:memo: This repository recorded my NLP journey.
https://makcedward.github.io/
1.07k stars 326 forks source link

Doc2Vec #2

Closed saravananpsg closed 5 years ago

saravananpsg commented 5 years ago

@makcedward I am trying to retrieve similar documents from the given document. Here is the code snippet:

x_train_t = doc2vec_embs.encode(documents=x_train) x_test_t = doc2vec_embs.encode(documents=x_test)

def similiar_docs(doc2vec_embs, test_sample): sims = doc2vec_embs.model.docvecs.most_similar([test_sample], topn=1) for s in sims: print(x_train[s[0]])

test_sample = x_test_t[0] print(x_test[0]) similiar_docs(doc2vec_embs, test_sample)

However, the retrieved docs aren't similar. Am I missing something here?

pblin commented 5 years ago

I encountered some problems running nlp-embeddings-document-doc2vec.ipynb 1) SSLCertVerificationError Traceback (most recent call last) /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py in do_open(self, http_class, req, **http_conn_args) 1316 h.request(req.get_method(), req.selector, req.data, headers, -> 1317 encode_chunked=req.has_header('Transfer-encoding')) 1318 except OSError as err: # timeout error

2) NameError Traceback (most recent call last)

in 1 doc2vec_embs = Doc2VecEmbeddings() ----> 2 x_train_tokens = doc2vec_embs.build_vocab(documents=x_train) 3 doc2vec_embs.train(x_train_tokens) NameError: name 'x_train' is not defined
makcedward commented 5 years ago

I encountered some problems running nlp-embeddings-document-doc2vec.ipynb

  1. SSLCertVerificationError Traceback (most recent call last) /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/urllib/request.py in do_open(self, http_class, req, **http_conn_args) 1316 h.request(req.get_method(), req.selector, req.data, headers, -> 1317 encode_chunked=req.has_header('Transfer-encoding')) 1318 except OSError as err: # timeout error
  2. NameError Traceback (most recent call last) in 1 doc2vec_embs = Doc2VecEmbeddings() ----> 2 x_train_tokens = doc2vec_embs.build_vocab(documents=x_train) 3 doc2vec_embs.train(x_train_tokens)

NameError: name 'x_train' is not defined

From my notebook,

x_train is defined as x_train, x_val, y_train, y_val = train_test_split(np.array(train_raw_df.data), train_raw_df.target, test_size=0.1)

makcedward commented 5 years ago

@makcedward I am trying to retrieve similar documents from the given document. Here is the code snippet:

x_train_t = doc2vec_embs.encode(documents=x_train) x_test_t = doc2vec_embs.encode(documents=x_test)

def similiar_docs(doc2vec_embs, test_sample): sims = doc2vec_embs.model.docvecs.most_similar([test_sample], topn=1) for s in sims: print(x_train[s[0]])

test_sample = x_test_t[0] print(x_test[0]) similiar_docs(doc2vec_embs, test_sample)

However, the retrieved docs aren't similar. Am I missing something here?

Score is depending on training data and features. Many people mentioned that no feature engineering is required for deep learning. It is true somehow but you still need to tell neural network that how to extra feature. For example, you may add Part-of-Speech, character etc.