GridSearch for Walking and Sampling Strategies

ChrisDelClea commented 2 years ago

🚀 Feature

Hi guys, first of all, thanks for this amazing package, i love it. While doing some stuff with it, i asked myself what the best walking / sampling strategy would be. To evaluate that, two things would be missing:

first) a way to evaluate the goodness of the embedding or if they really capture the semantics of the graph and
second) a GridSearch kind of class to run all of the different options.

Additional context

Solution

GridSearch Class a la sklearn would be awesome, is that possible?

GillesVandewiele commented 2 years ago

Hi Chris,

Thank you for your suggestion, I agree that it would be a nice addition! However, it might be really difficult/expensive to run this, especially given the fact that some of these strategies have their own hyper-parameters as well. The hyper-parameters of the Embedding techniques (e.g. Word2Vec) might change per strategy as well (and even the hyper-parameters of the downstream ML model), and the possible combinations of walking + sampling strategies quickly grows.

ChrisDelClea commented 2 years ago

I see, thanks for that.

I have have a few other questions/remarks/wishes:

Could you also implement or integrate other none-word2vec based kg emebedding algorithms in this framework? I know this request might sound strange first, but i could not find any framework that would take an RDF Graph as Input to generate embeddings for e.g. RotatrE and i love your API.
Also i wrote a function to get the most k similar entities out of the graph given an input entity. Would be nice to have something inbuild? transformer.most_similar(e1) => e2, e3, e4
Also automatic plotting of the embeddings with tensorboard or something out of the box would be really great.
In the Blogpost you wrote that numerical data is an issue, do you have any suggestions how to fix it? I am asking since i am trying to embedd a kg where i have a lot of numerical e.g. age, piz code etc.
Another question regarding your last point: RDF2Vec cannot deal with volatile data. In Machine Learning we normally want to embedd an unseen datapoint in the vectorspace and find most similar entities. So i was thinking what would happen if i call walk_embeddings = transformer.transform(unseen_entity)?
The blogpost was super helpful to understand the package. Just some things do not seem up to date anymore e.g. the skip_predicates vs. the scip_predicates. Also i was wondering if i had to list all literals as parameters?

Best regards Chris

GillesVandewiele commented 2 years ago

I see, thanks for that.

I have have a few other questions/remarks/wishes:

* Could you also implement or integrate other none-word2vec based kg emebedding algorithms in this framework?  I know this request might sound strange first, but i could not find any framework that would take an RDF Graph as Input to generate embeddings for e.g. RotatrE and i love your API.

Yes, we tried to support that through the Embedder interface. You can implement your own embedding model (e.g. the fast-text implementation we provide)

* Also i wrote a function to get the most k similar entities out of the graph given an input entity. Would be nice to have something inbuild? `transformer.most_similar(e1) => e2, e3, e4`

I agree that this indeed would be nice! However, how would one define most_similar? Cosine distance comes to mind as a metric, but perhaps other people would want other metrics? This should be made configurable. But nevertheless, it is indeed an interesting suggestion!

* Also automatic plotting of the embeddings with tensorboard or something out of the box would be really great.

Do you mean plotting during the training procedure? That should be possible with gensim models indeed.

* In the Blogpost you wrote that numerical data is an issue, do you have any suggestions how to fix it? I am asking since i am trying to embedd a kg where i have a lot of numerical e.g. age, piz code etc.

We don't have a optimal fix yet, however, you can specify the paths to extract this numerical information from the KG. You could then embed this numerical data to your embeddings before fitting a downstream ML model on it (or taking these numbers into account for your similarity search as well).

* Another question regarding your last point: ` RDF2Vec cannot deal with volatile data`.  In Machine Learning we normally want to embedd an unseen datapoint in the vectorspace and find most similar entities. So i was thinking what would happen if i call `walk_embeddings = transformer.transform(unseen_entity)`?

A mechanism for updating the model has been implemented, If you set is_update=True to the fit method, it should update the model without re-training it entirely from scratch. Nevertheless, RDF2Vec is unfortunately a transductive (i.e. non-inductive) technique inherently, and these mechanisms are far from an optimal solution.

* The blogpost was super helpful to understand the package. Just some things do not seem up to date anymore e.g. the `skip_predicates` vs. the `scip_predicates`. Also i was wondering if i had to list all `literals` as parameters?

No, these are optional, but they might be useful to extract numerical information (per your comment above)

Best regards Chris

ChrisDelClea commented 1 year ago

Yes Cosine or Euclidean would be great distance mesaure to start with.
In other frameworks they offer directly a method (e.g. embeddings.to_tensorboard() ) that exports the embeddings and suff in tensorboard format. Similar function would be helpful here ofc, too!
I still don't fully understand what the literals parameter is helpful for or when to use it. Could you explain a bit more about it?
Could you provide a full example of how to use is_update=True ? I wonder what input i had to pass, all data or only novel one?

GillesVandewiele commented 1 year ago

The feature requests are noted. I'll take a look if I ever find bandwidth :), any PRs are more than welcome as well of course.

Literals are needed to use numerical information for your ML model. Word2Vec internally uses a BoW representation for its tokens. 9 and 10 are in other words COMPLETELY different tokens and the numerical relation between them (the fact that they are actually quite similar) is lost. As such, you can get embeddings and literals together, concatenate them and feed it to your ML model. Example here

Here's an example of the is_update

predict-idlab / pyRDF2Vec