predict-idlab / pyRDF2Vec

🐍 Python Implementation and Extension of RDF2Vec
https://pyrdf2vec.readthedocs.io/en/latest/
MIT License
246 stars 51 forks source link

Generate embeddings of new points #57

Closed rhoi2021 closed 2 years ago

rhoi2021 commented 2 years ago

Hello, I have two questions: 1- When we use .fit to generate the embeddings, how to save the model and use it to create the embeddings of new entities of the KG? 2- Before when I used this library, I did not get any error for the label_predicates, while now it returns an unexpected keyword argument! kg = KG("dataset.xml", label_predicates=[rdflib.URIRef(x) for x in label_predicates])

rememberYou commented 2 years ago

Good evening,

For your first question, you have to use the save method of the RDF2VecTransformer class. This method allows you to serialize an RDF2VecTransformer object as well as the attributes it has (e.g., embedder and walkers).

For your second question, you should to use a newer version of the library. pyRDF2Vec 0.2.3 has replaced the label_predicates list by skip_predicates and does not use the rdflib.URIRef function anymore.

You can find all this information in the README and also have examples of use-cases.

rhoi2021 commented 2 years ago

I'm lost. what is the function to predict the embeddings of new points? I used fit_transform to train and then save the model and I thought transform function was to predict the embedding of a new entity. I get errors! am I doing something wrong?

rememberYou commented 2 years ago

The fit_transform function is used to train the model and generate embeddings for the data passed for training and testing. It avoids calling the fit and the transform function consecutively.

To generate these embeddings, pyRDF2Vec uses:

The combination of these two strategies extracts walks which are then taken as input by an embedding technique (e.g., Word2Vec) to generate the embeddings of the provided training and testing entities. Once these embeddings are generated, you then have to pass them to a model of the ML problem you want to solve (e.g., using SVM for classification).

The generated embeddings (with or without literals) allow you to enrich the data provided so that your ML model can make better predictions. Don't hesitate to have a look at the diagram in the README on how pyRDF2Vec works. This will give you a visual of how all these components communicate with each other.

bsteenwi commented 2 years ago

To comment on this: RDF2Vec is transductive, which mean that all embeddings (both from train and test) are generated in one batch (unsupervised of course).

In other words: when new entities arrive (new nodes from which you would like to generate the embedding), RDF2Vec requires you to regenerate all embeddings (all available nodes + new nodes).

We, in pyRDF2Vec, found this somewhat a limiting factor and therefore started to explore online-learning techniques to resolve this retraining fase. You can find an example here.

rhoi2021 commented 2 years ago

I am doing the same but when I load the model and use fit_transform for a new point, it throws the following error: ValueError: The entities must have been provided to fit() first before they can be transformed into a numerical vector.

rememberYou commented 2 years ago

@rhoi2021 Could you paste the buggy code snippet here or via gist?

rhoi2021 commented 2 years ago
db_transformer = RDF2VecTransformer(walkers=[RandomWalker(8, 20, with_reverse = True, n_jobs = 1)], embedder=Word2Vec(size = 100)).load("model")

db_transformer.fit_transform(
    db_kg,
    input_entity,
    is_update=True,
)

WARNING - under 10 jobs per worker: consider setting a smaller `batch_words' for smoother alpha decay
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-21-35fbed516445> in <module>()
      5     db_kg,
      6     input_entity,
----> 7     is_update=True,
      8 )
      9 

2 frames

/usr/local/lib/python3.7/dist-packages/pyrdf2vec/embedders/word2vec.py in transform(self, entities)
     69         if not all([entity in self._model.wv for entity in entities]):
     70             raise ValueError(
---> 71                 "The entities must have been provided to fit() first "
     72                 "before they can be transformed into a numerical vector."
     73             )

ValueError: The entities must have been provided to fit() first before they can be transformed into a numerical vector.
rememberYou commented 2 years ago

Could you parse your entire code, including how you train your model with RDF2VecTransformer and what value is assigned to the input_entity variable?

rhoi2021 commented 2 years ago

train the model:

db_transformer = RDF2VecTransformer(walkers=[RandomWalker(8, 20, with_reverse = True, n_jobs = 1)], embedder=Word2Vec(size = 100))

db_walk_embeddings, _ = db_transformer.fit_transform(
    db_kg,
    db_entities,
)

db_transformer.save("model")
rhoi2021 commented 2 years ago

new point: input_entity = ['http://dbpedia.org/resource/Hallstatt_culture']

db_entities includes some entities from DBPedia

rhoi2021 commented 2 years ago

I exactly load the data with pandas. I guess we can use remote KGs, db_kg = KG("https://dbpedia.org/sparql", is_remote=True), am I wrong? I followed this repository https://github.com/IBCNServices/pyRDF2Vec/wiki/Fast-generation-of-RDF2Vec-embeddings-with-a-SPARQL-endpoint to set up my own local DBPedia SPARQL endpoint

rememberYou commented 2 years ago

I removed my comment to avoid confusion, but yes you can add an entity directly by providing a list (without going through a file) as you just did. You don't need to use is_remote=True with the new version of pyRDF2Vec. Unfortunately, without your entire code, I can't test and debug it. What I can say is that I don't see an immediate problem in your code.

bsteenwi commented 2 years ago
import os

import pandas as pd
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

from pyrdf2vec import RDF2VecTransformer
from pyrdf2vec.embedders import Word2Vec
from pyrdf2vec.graphs import KG
from pyrdf2vec.walkers import RandomWalker

RANDOM_STATE = 22
if __name__ == '__main__':
    db_kg = KG("https://dbpedia.org/sparql", skip_verify=True)
    print(db_kg)

    db_transformer = RDF2VecTransformer(walkers=[RandomWalker(1, None, n_jobs=2, random_state=RANDOM_STATE)], embedder=Word2Vec(size = 100), verbose=1)
    db_entities = ['http://dbpedia.org/resource/Belgium','http://dbpedia.org/resource/France']
    db_walk_embeddings, _ = db_transformer.fit_transform(
    db_kg,
    db_entities,
    )

    db_transformer.save("model")

    db_transformer = RDF2VecTransformer(walkers=[RandomWalker(1, None, n_jobs=2, random_state=RANDOM_STATE)], embedder=Word2Vec(size = 100)).load("model")

    input_entity = ['http://dbpedia.org/resource/Hallstatt_culture']
    db_transformer.fit_transform(
        db_kg,
        input_entity,
        is_update=True,
    )

This works on my system without errors (pyrdf2vec version 0.2.3) (depth is 1 to speed up this test)

rememberYou commented 2 years ago

It works for me too. I only had to replace the size attribute with vector_size to be compliant with gensim 4.1.2.

rhoi2021 commented 2 years ago

Many thanks for your replies :) I am not sure but I guess it is because of with_reverse. Once I specify the model with_reverse=True it does not work(at least for me) but with the default(False) works well.

bsteenwi commented 2 years ago

Hmm ok fair point, seems like a bug indeed...

I've investigated this problem a little more in-depth together with @GillesVandewiele and we concluded the following: Our walk hashing procedure within pyrdf2vec is as follows:

vertex.name if i == 0 or i % 2 == 1 or self.md5_bytes is None else hash...

More in general, we do not hash the first node within a walk as this node is in most cases the root node (those for which you want to generate an embedding).

we check if the root node is within the vw vocab during the transform function

if not all([entity in self._model.wv for entity in entities]):
    raise ValueError(
                "The entities must have been provided to fit() first "
                "before they can be transformed into a numerical vector."
            )

But with the with_reverse option set to true, the walks are constructed differently:

if self.with_reverse:
            return [
                r_walk[:-1] + walk
                for walk in fct_search(kg, entity)
                for r_walk in fct_search(kg, entity, is_reverse=True)
            ]

Here you can see that the reverse paths are prefixed to the normal paths, resulting in a node at index 0 different from the root node. The root node will now be hashed somewhere within this path, and cannot be found in the wv vocab in its original form.

As @GillesVandewiele suggested, we might better hash all subjects and predicates within pyrdf2vec and check if the hashed root nodes are available in de wv vocab.

GillesVandewiele commented 2 years ago

Should be fixed with latest push to master!