uma-pi1 / kge

LibKGE - A knowledge graph embedding library for reproducible research
MIT License
765 stars 124 forks source link

How to stack user-defined embedder to the current decoder models #233

Closed jwzhi closed 2 years ago

jwzhi commented 2 years ago

Hi, thanks again for solving the issue of adding embedder classes. I am trying to extend libkge by adding new embedders and use them on the existing decoder-only models (TransE, DistMult etc.) One issue that I am encountering is, I do not know how to stack my newly defined embedder on the decoder models. For example, if I want to use gcn (embedder) + distmult (decoder), what is the correct way to write the config? In distmult.yaml, I changed it to


distmult:
  class_name: DistMult
  entity_embedder:
    type: gcn_embedder
    +++: +++
  relation_embedder:
    type: lookup_embedder
    +++: +++ 

But it will raise the following error

[b26706d1] Failed to create model distmult (class DistMult).
[b26706d1] Failed to create model reciprocal_relations_model (class ReciprocalRelationsModel).
Traceback (most recent call last):
  File "/datadrive/data/KG-GNN/kge/kge/config.py", line 56, in get
    result = result[name]
KeyError: 'type'

My embedder is appended as follows, in gcn_embedder.yaml

import: [lookup_embedder]

gcn_embedder:
  class_name: GCNEmbedder
  base_embedder:
    type: lookup_embedder
    +++: +++                  # -1 means: same as base_embedder
  initialize: normal_          # xavier, uniform, normal
  initialize_args:
    +++: +++
  dropout: 0.                 # dropout used for embeddings
  regularize: 'lp'              # '', 'lp'
  regularize_weight: 0.0
  regularize_args:
    p: 2.0
  # see dgl.graphconv
  in_dim: 256                    # the input dimension
  dim: 256        # the dimension of the gcn layers
  dropout: 0.3
  activation: relu             # relu or gelu

  num_layers: 3                # the number of sub-encoder-layers in the encoder

In gcn_embedder.py,

import torch.nn.functional
from kge.model import KgeEmbedder

from dgl.nn.pytorch import GraphConv
import pdb

class GCNEmbedder(KgeEmbedder):
    """Adds a linear projection layer to a base embedder."""
    """ gcn embedder """

    def __init__(
        self, config, dataset, configuration_key, vocab_size, init_for_load_only=False
    ):
        super().__init__(
            config, dataset, configuration_key, init_for_load_only=init_for_load_only
        )

        # initialize base_embedder
        if self.configuration_key + ".base_embedder.type" not in config.options:
            config.set(
                self.configuration_key + ".base_embedder.type",
                self.get_option("base_embedder.type"),
            )
        self.base_embedder = KgeEmbedder.create(
            config, dataset, self.configuration_key + ".base_embedder", vocab_size
        )

        self.g = dataset.construct_graph().to('cuda:0')
        # self.g = dataset.construct_graph()
        self.emb_dim = self.get_option("entity_embedder.dim")
        dropout = self.get_option("encoder.dropout")
        if dropout < 0.0:
            if config.get("train.auto_correct"):
                config.log(
                    "Setting {}.encoder.dropout to 0., "
                    "was set to {}.".format(configuration_key, dropout)
                )
                dropout = 0.0
        self.encoder = torch.nn.ModuleList()
        self.num_layers = self.get_option("encoder.num_layers")
        self.in_dim = self.get_option("encoder.in_dim")
        self.dim = self.get_option("encoder.dim")
        self.encoder.append(GraphConv(self.in_dim, self.dim))
        for i in range(self.num_layers-1):
            self.encoder.append(GraphConv(self.dim, self.dim))
        #Dropout after compGCN layers
        self.dropouts = torch.nn.ModuleList()
        for i in range(self.num_layers):
            self.dropouts.append(
                torch.nn.Dropout(dropout)
            )   
        print("initialized GCN Embedder")  

    def _embed(self, embeddings, indexes):
        n_feats = embeddings
        for layer, dropout in zip(self.encoder, self.dropouts):
            n_feats = layer(self.g, n_feats)
            n_feats = dropout(n_feats)
        embeddings = n_feats[indexes]
        return embeddings

    def embed(self, indexes):
        return self._embed(self.base_embedder.embed_all(), indexes)

    def embed_all(self):
        return self._embed(self.base_embedder.embed_all(), indexes)

    def penalty(self, **kwargs):
        # TODO factor out to a utility method
        if self.regularize == "" or self.get_option("regularize_weight") == 0.0:
            result = []
        elif self.regularize == "lp":
            p = self.get_option("regularize_args.p")
            result = [
                (
                    f"{self.configuration_key}.L{p}_penalty",
                    self.get_option("regularize_weight")
                    * self.projection.weight.norm(p=p).sum(),
                )
            ]
        else:
            raise ValueError("unknown penalty")

        return super().penalty(**kwargs) + result + self.base_embedder.penalty(**kwargs)

Thanks in advance

rgemulla commented 2 years ago

Do you use the most recent version of LibKGE? If so, you should see a more informative error message, which should make it more clear what's going on:

https://github.com/uma-pi1/kge/blob/db908a99df5efe20f960dc3cf57eb57206c2f36c/kge/config.py#L55-L58

rgemulla commented 2 years ago

BTW: you do not need (and probably shouldn't) change distmult.yaml. Instead, set distmult.entity_embedder.type in your training config.

jwzhi commented 2 years ago

Do you use the most recent version of LibKGE? If so, you should see a more informative error message, which should make it more clear what's going on:

https://github.com/uma-pi1/kge/blob/db908a99df5efe20f960dc3cf57eb57206c2f36c/kge/config.py#L55-L58

Yes. It says File "/datadrive/data/KG-GNN/kge/kge/config.py", line 58, in get raise KeyError(f"Error accessing {name} for key {key}") KeyError: 'Error accessing type for key gcn_embedder.type'. But I am a bit confused. I think I should set the type for gcn_embedder in gcn_embedder.base_embedder.type right? I saw you wrote projection_embedder in this way at least.

This is what I did in gcn_embedder.yaml

  class_name: GCNEmbedder
  base_embedder:
    type: lookup_embedder
    +++: +++  
jwzhi commented 2 years ago

BTW: you do not need (and probably shouldn't) change distmult.yaml. Instead, set distmult.entity_embedder.type in your training config. Ok. Thanks for telling me that. I just changed it. But I still cannot make it work (sadly)

rgemulla commented 2 years ago

Well, I cannot see gcn_embedder.type in your gcn_embedder.yaml file. Can you provide the complete stack trace as well as the job configuration?

jwzhi commented 2 years ago

Yes. Here is the stack trace

[ea1d10df] Failed to create model distmult (class DistMult).
[ea1d10df] Failed to create model reciprocal_relations_model (class ReciprocalRelationsModel).
Traceback (most recent call last):
  File "/datadrive/data/KG-GNN/kge/kge/config.py", line 56, in get
    result = result[name]
KeyError: 'type'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/datadrive/data/KG-GNN/kge/kge/config.py", line 93, in get_default
    parent_type = self.get(parent + "." + "type")
  File "/datadrive/data/KG-GNN/kge/kge/config.py", line 58, in get
    raise KeyError(f"Error accessing {name} for key {key}")
KeyError: 'Error accessing type for key gcn_embedder.type'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/azureuser/.pyenv/versions/kge/bin/kge", line 11, in <module>
    load_entry_point('libkge', 'console_scripts', 'kge')()
  File "/datadrive/data/KG-GNN/kge/kge/cli.py", line 279, in main
    job = Job.create(config, dataset)
  File "/datadrive/data/KG-GNN/kge/kge/job/job.py", line 83, in create
    config, dataset, parent_job=parent_job, model=model
  File "/datadrive/data/KG-GNN/kge/kge/job/train.py", line 136, in create
    forward_only=forward_only,
  File "/datadrive/data/KG-GNN/kge/kge/misc.py", line 38, in init_from
    return getattr(module, class_name)(*args, **kwargs)
  File "/datadrive/data/KG-GNN/kge/kge/job/train_negative_sampling.py", line 20, in __init__
    config, dataset, parent_job, model=model, forward_only=forward_only
  File "/datadrive/data/KG-GNN/kge/kge/job/train.py", line 71, in __init__
    self.model: KgeModel = KgeModel.create(config, dataset)
  File "/datadrive/data/KG-GNN/kge/kge/model/kge_model.py", line 497, in create
    init_for_load_only=init_for_load_only,
  File "/datadrive/data/KG-GNN/kge/kge/misc.py", line 38, in init_from
    return getattr(module, class_name)(*args, **kwargs)
  File "/datadrive/data/KG-GNN/kge/kge/model/reciprocal_relations_model.py", line 38, in __init__
    init_for_load_only=init_for_load_only,
  File "/datadrive/data/KG-GNN/kge/kge/model/kge_model.py", line 497, in create
    init_for_load_only=init_for_load_only,
  File "/datadrive/data/KG-GNN/kge/kge/misc.py", line 38, in init_from
    return getattr(module, class_name)(*args, **kwargs)
  File "/datadrive/data/KG-GNN/kge/kge/model/distmult.py", line 43, in __init__
    init_for_load_only=init_for_load_only,
  File "/datadrive/data/KG-GNN/kge/kge/model/kge_model.py", line 388, in __init__
    init_for_load_only=init_for_load_only,
  File "/datadrive/data/KG-GNN/kge/kge/model/kge_model.py", line 281, in create
    init_for_load_only=init_for_load_only,
  File "/datadrive/data/KG-GNN/kge/kge/misc.py", line 38, in init_from
    return getattr(module, class_name)(*args, **kwargs)
  File "/datadrive/data/KG-GNN/kge/kge/model/embedder/gcn_embedder.py", line 31, in __init__
    self.emb_dim = self.get_option("entity_embedder.dim")
  File "/datadrive/data/KG-GNN/kge/kge/config.py", line 619, in get_option
    return self.config.get_default(self.configuration_key + "." + name)
  File "/datadrive/data/KG-GNN/kge/kge/config.py", line 103, in get_default
    raise e
  File "/datadrive/data/KG-GNN/kge/kge/config.py", line 83, in get_default
    return self.get(key)
  File "/datadrive/data/KG-GNN/kge/kge/config.py", line 58, in get
    raise KeyError(f"Error accessing {name} for key {key}")
KeyError: 'Error accessing entity_embedder for key reciprocal_relations_model.base_model.entity_embedder.entity_embedder.dim'

The job configuration is as follows

  name: fb15k-237
distmult:
  entity_embedder:
    type: gcn_embedder
    dropout: 0.4196834675552332
    regularize_weight: 2.816637953889144e-09
  relation_embedder:
    dropout: 0.40971036404279193
    regularize_weight: 8.19925611568694e-15
eval:
  batch_size: 256
  metrics_per:
    relation_type: true
  trace_level: example
import:
- distmult
- reciprocal_relations_model
lookup_embedder:
  dim: 256
  initialize: uniform_
  initialize_args:
    normal_:
      mean: 0.0
      std: 0.04037805388365049
    uniform_:
      a: -0.9352212163936202
    xavier_normal_:
      gain: 1.0
    xavier_uniform_:
      gain: 1.0
  regularize_args:
    p: 3
    weighted: true
model: reciprocal_relations_model
negative_sampling:
  implementation: batch
  num_samples:
    o: 402
    p: -1
    s: 255
reciprocal_relations_model:
  base_model:
    type: distmult

train:
  auto_correct: true
  batch_size: 1024
  loss_arg: 1.0
  lr_scheduler: ReduceLROnPlateau
  lr_scheduler_args:
    factor: 0.95
    mode: max
    patience: 6
    threshold: 0.0001
  max_epochs: 400
  optimizer_args:
    lr: 0.15953749294870845
  type: negative_sampling
valid:
  early_stopping:
    min_threshold:
      epochs: 50
      metric_value: 0.05
    patience: 10
rgemulla commented 2 years ago

The error is here:

  File "/datadrive/data/KG-GNN/kge/kge/model/embedder/gcn_embedder.py", line 31, in __init__
    self.emb_dim = self.get_option("entity_embedder.dim")

There is no key gcn_embedder.entity_embedder.dim. You probably want to use "dim" instead of "entity_embedder.dim".

jwzhi commented 2 years ago

That solve the problem. Thank you!