guenthermi / postgres-word2vec

utils to use word embedding models like word2vec vectors in a PostgreSQL database
MIT License
145 stars 19 forks source link

install Issue #12

Closed angelo337 closed 5 years ago

angelo337 commented 5 years ago

hi there, I manage to get install all dependencies and load your extension in a docker container, however when i arrive to te las step in the process, (Statistics) I am getting this error:

SELECT create_statistics('google_vecs_norm', 'word', 'coarse_quantization_ivpq');
ERROR:  function get_vecs_name_ivpq_quantization() does not exist
LINE 1: SELECT get_vecs_name_ivpq_quantization()
               ^
HINT:  No function matches the given name and argument types. You might need to add explicit type casts.
QUERY:  SELECT get_vecs_name_ivpq_quantization()
CONTEXT:  PL/pgSQL function create_statistics(character varying,character varying,character varying) line 10 at EXECUTE

Could you please help me or point me out some sort of solution? thanks very much

guenthermi commented 5 years ago

Hi,

in order to create the statistics the name of the ivpq index tables must be known. I think this is the problem here.

The order of steps in the README seems to be slightly wrong. Before the statistics can be created you have to call init function first or execute CREATE EXTENSION freedy agiain, if you used the same names for the index table as described in the README.

SELECT init('google_vecs', 'google_vecs_norm', 'pq_quantization', 'pq_codebook', 'fine_quantization', 'coarse_quantization', 'residual_codebook', 'fine_quantization_ivpq', 'codebook_ivpq', 'coarse_quantization_ivpq')

I will change this in the README soon...

angelo337 commented 5 years ago

Guenthermi: thanks for your fast answer, could you please post me out some resource to follow after i got that working? I mean I already use Gensim vectors, with the Bin file I just compare documents agains that model. in your implementation how should i do that? with word2vec, I just request vectors to the index and produce a large vector for each sentence and classify that sentence with a Keras model or SVM. However I don't know in your implementation the right path to follow. thanks angelo

guenthermi commented 5 years ago

I think I don't understand what you actually want to do.

There are several ways to compare documents or text values consisting of several tokens by using word embeddings. One very simple method is to represent a larger text value by calculating the centroid (average value) of all word embedding vectors of terms occurring in the text value. This could be done by using the insert_batch function of the extension, which calculate this vector and add it to the index structures. However, if you want to do something more complex you have to implement this yourself.

The purpose of this extension mainly focuses on fast semantic search. If you want to do classification you might use something else. However, you could use the kNN and kNN-Join functions as a kNN classifier if this makes sense in your case.

angelo337 commented 5 years ago

Guenthermi: I am trying to find similar words from a corpus I already train wikipedia in my language and I am looking for similar words from that embedding, after that I would like to create a search with an elastic search trying to mimic semantic search, launching a search for every single similar word from the original until certain distance 90% or so of similarity. at the moment that's my main idea. thanks