pykeen / benchmarking

📊 Results from the reproducibility and benchmarking studies presented in "Bringing Light Into the Dark: A Large-scale Evaluation of Knowledge Graph Embedding Models Under a Unified Framework" (http://arxiv.org/abs/2006.13365)
MIT License
35 stars 4 forks source link

Lower results than in the paper for a model (probably doing something wrong) #27

Open Filco306 opened 1 year ago

Filco306 commented 1 year ago

Hello!

Thank you for a nice study and a nice repository! :D I am currently trying to re-use some of your hyperparameters from the study, e.g., those for Complex for the YAGO3-10 dataset. However, upon trying to use the config files with the current version of Pykeen, I am landing with the error that owa is not an option, but only ['lcwa', 'slcwa'] are valid options. I saw that you changed the name of OWA to SLCWA, so I switched from OWA to SLCWA as instructed.

However, training locally with Pykeen 1.9.0 and slcwa gives me very different results; on the validation set, I get extremely low results, although I seem to get pretty decent (but still far from the outputted metrics there) on the test set. For this specific run, I got for the corresponding metrics:

# Results from me re-running the best experiment config found below
 'testing.both.realistic.inverse_harmonic_mean_rank': 0.3114,
 'testing.both.realistic.hits_at_1': 0.2213, 
 'testing.both.realistic.hits_at_3': 0.3609,
 'testing.both.realistic.hits_at_5': 0.419,
 'testing.both.realistic.hits_at_10': 0.4864,

 'validation.both.realistic.inverse_harmonic_mean_rank': 0.08471,
 'validation.both.realistic.hits_at_1': 0.0234,
 'validation.both.realistic.hits_at_3': 0.07252,
 'validation.both.realistic.hits_at_5': 0.136,
 'validation.both.realistic.hits_at_10': 0.2665,

# Results from benchmark database
 'results.metrics.inverse_harmonic_mean_rank.both.realistic': 0.46196680766490855,
 'results.metrics.hits_at_k.both.realistic.1': 0.3727418707346447,
 'results.metrics.hits_at_k.both.realistic.3': 0.5171617824167001,
 'results.metrics.hits_at_k.both.realistic.5': 0.5700521878763549,
 'results.metrics.hits_at_k.both.realistic.10': 0.6230429546366921,

I attach my training script below; I am most likely doing something wrong or not considering some specific setting that was updated in the more recent version of pykeen. Thanks again for a nice tool! :)

The results from the database can be seen below.

Config file (originally this one):

{
    "metadata": 
    {
        "best_trial_evaluation": 0.6191241462434712, 
        "best_trial_number": 3, 
        "git_hash": "UNHASHED", 
        "version": "0.1.2-dev"
    }, 
    "pipeline": 
    {
        "dataset": "yago310", 
        "dataset_kwargs": {
            "create_inverse_triples": false
        }, 
        "evaluation_kwargs": {
            "batch_size": null
        }, 
        "evaluator": "rankbased", 
        "evaluator_kwargs": 
        {
            "filtered": true
        }, 
        "loss": "softplus", 
        "model": "complex", 
        "model_kwargs": 
        {
            "automatic_memory_optimization": true, "embedding_dim": 256
        }, 
        "negative_sampler": "basic", 
        "negative_sampler_kwargs": {"num_negs_per_pos": 32}, 
        "optimizer": "adam", 
        "optimizer_kwargs": {
            "lr": 0.001723135381847608, "weight_decay": 0.0
        }, 
        "regularizer": "no", 
        "training_kwargs": {
            "batch_size": 8192, "label_smoothing": 0.0, "num_epochs": 131
        }, 
    "training_loop": "owa"}
}

Running the following gives me great output metrics:

(kgvenv)filco:~/$ python3 ablation/search.py --dataset yago310 --model complex
============================== 0 ==============================
{'create_inverse_triples': False,
 'dataset': 'yago310',
 'evaluator': 'rankbased',
 'hpo.metadata.title': 'HPO Over YAGO3-10 for ComplEx',
 'hpo.optuna.direction': 'maximize',
 'hpo.optuna.metric': 'hits@10',
 'hpo.optuna.n_trials': 100,
 'hpo.optuna.pruner': 'nop',
 'hpo.optuna.sampler': 'random',
 'hpo.optuna.storage': 'sqlite:////home/lauve/dataintegration/POEM_benchmarking_results/pykeen_experimental_results/ablation/config/adam/complex/yago310/random/owa/2020-05-21-02-47_1218c513-997d-483e-8d3f-3d6c144d8fdd/0001_yago310_complex/optuna_results.db',
 'hpo.optuna.timeout': 86400,
 'hpo.pipeline.dataset': 'yago310',
 'hpo.pipeline.dataset_kwargs.create_inverse_triples': False,
 'hpo.pipeline.evaluation_kwargs.batch_size': None,
 'hpo.pipeline.evaluator': 'RankBasedEvaluator',
 'hpo.pipeline.evaluator_kwargs.filtered': True,
 'hpo.pipeline.loss': 'SoftplusLoss',
 'hpo.pipeline.model': 'ComplEx',
 'hpo.pipeline.model_kwargs.automatic_memory_optimization': True,
 'hpo.pipeline.model_kwargs_ranges.embedding_dim.high': 8,
 'hpo.pipeline.model_kwargs_ranges.embedding_dim.low': 6,
 'hpo.pipeline.model_kwargs_ranges.embedding_dim.scale': 'power_two',
 'hpo.pipeline.model_kwargs_ranges.embedding_dim.type': 'int',
 'hpo.pipeline.negative_sampler': 'BasicNegativeSampler',
 'hpo.pipeline.negative_sampler_kwargs_ranges.num_negs_per_pos.high': 50,
 'hpo.pipeline.negative_sampler_kwargs_ranges.num_negs_per_pos.low': 1,
 'hpo.pipeline.negative_sampler_kwargs_ranges.num_negs_per_pos.q': 1,
 'hpo.pipeline.negative_sampler_kwargs_ranges.num_negs_per_pos.type': 'int',
 'hpo.pipeline.optimizer': 'adam',
 'hpo.pipeline.optimizer_kwargs.weight_decay': 0.0,
 'hpo.pipeline.optimizer_kwargs_ranges.lr.high': 0.1,
 'hpo.pipeline.optimizer_kwargs_ranges.lr.low': 0.001,
 'hpo.pipeline.optimizer_kwargs_ranges.lr.scale': 'log',
 'hpo.pipeline.optimizer_kwargs_ranges.lr.type': 'float',
 'hpo.pipeline.regularizer': 'NoRegularizer',
 'hpo.pipeline.stopper': 'early',
 'hpo.pipeline.stopper_kwargs.delta': 0.002,
 'hpo.pipeline.stopper_kwargs.frequency': 10,
 'hpo.pipeline.stopper_kwargs.patience': 5,
 'hpo.pipeline.training_kwargs.label_smoothing': 0.0,
 'hpo.pipeline.training_kwargs.num_epochs': 1000,
 'hpo.pipeline.training_kwargs_ranges.batch_size.high': 13,
 'hpo.pipeline.training_kwargs_ranges.batch_size.low': 10,
 'hpo.pipeline.training_kwargs_ranges.batch_size.scale': 'power_two',
 'hpo.pipeline.training_kwargs_ranges.batch_size.type': 'int',
 'hpo.pipeline.training_loop': 'owa',
 'hpo.type': 'hpo',
 'loss': 'softplus',
 'metadata.best_trial_evaluation': 0.6191241462434712,
 'metadata.best_trial_number': 3,
 'metadata.git_hash': 'UNHASHED',
 'metadata.version': '0.1.2-dev',
 'metric': 'hits@10',
 'model': 'complex',
 'negative_sampler': 'basic',
 'optimizer': 'adam',
 'pipeline_config.metadata.best_trial_evaluation': 0.6191241462434712,
 'pipeline_config.metadata.best_trial_number': 3,
 'pipeline_config.metadata.git_hash': 'UNHASHED',
 'pipeline_config.metadata.version': '0.1.2-dev',
 'pipeline_config.pipeline.dataset': 'yago310',
 'pipeline_config.pipeline.dataset_kwargs.create_inverse_triples': False,
 'pipeline_config.pipeline.evaluation_kwargs.batch_size': None,
 'pipeline_config.pipeline.evaluator': 'rankbased',
 'pipeline_config.pipeline.evaluator_kwargs.filtered': True,
 'pipeline_config.pipeline.loss': 'softplus',
 'pipeline_config.pipeline.model': 'complex',
 'pipeline_config.pipeline.model_kwargs.automatic_memory_optimization': True,
 'pipeline_config.pipeline.model_kwargs.embedding_dim': 256,
 'pipeline_config.pipeline.negative_sampler': 'basic',
 'pipeline_config.pipeline.negative_sampler_kwargs.num_negs_per_pos': 32,
 'pipeline_config.pipeline.optimizer': 'adam',
 'pipeline_config.pipeline.optimizer_kwargs.lr': 0.001723135381847608,
 'pipeline_config.pipeline.optimizer_kwargs.weight_decay': 0.0,
 'pipeline_config.pipeline.regularizer': 'no',
 'pipeline_config.pipeline.training_kwargs.batch_size': 8192,
 'pipeline_config.pipeline.training_kwargs.label_smoothing': 0.0,
 'pipeline_config.pipeline.training_kwargs.num_epochs': 131,
 'pipeline_config.pipeline.training_loop': 'owa',
 'pykeen_git_hash': 'UNHASHED',
 'pykeen_version': '0.1.2-dev',
 'regularizer': 'no',
 'replicate': 0,
...
 'results.metrics.hits_at_k.both.realistic.1': 0.3727418707346447,
 'results.metrics.hits_at_k.both.realistic.10': 0.6230429546366921,
 'results.metrics.hits_at_k.both.realistic.3': 0.5171617824167001,
 'results.metrics.hits_at_k.both.realistic.5': 0.5700521878763549,
 'results.metrics.inverse_harmonic_mean_rank.both.realistic': 0.46196680766490855,
...
 'searcher': 'random',
 'training_loop': 'owa'}

Version:

>>> pykeen.get_version()
'1.9.0'
import json
import os
import wandb
from pykeen.trackers import WANDBResultTracker, CSVResultTracker
from pykeen import pipeline
from pykeen import datasets
from utils import flatten_dict
import argparse

PROJECT_NAME = "Pykeen Knowledge Graph Embeddings"

DSETNAME2DSET = {
    "kinships": "Kinships",
    "fb15k": "FB15k",
    "fb15k237": "FB15k237",
    "wn18": "WN18",
    "wn18rr": "WN18RR",
    "yago310": "YAGO310",
}

def run_transductive(config: dict, use_wandb: bool):

    if use_wandb:
        print("Using wandb tracker")
        wandb.init(
            project=PROJECT_NAME,
            entity=ENTITYNAME,
            name=f"{config['pipeline']['model']}-{config['pipeline']['dataset']}",
        )
        tracker = WANDBResultTracker(
            project=PROJECT_NAME,
            entity=ENTITYNAME,
            group=None,
            settings=wandb.Settings(start_method="fork"),
        )
        # tracker.wandb.config.update(flatten_dict(config))
        tracker.wandb.name = (
            f"{config['pipeline']['model']}-{config['pipeline']['dataset']}"
        )
    else:
        tracker = CSVResultTracker()
    dataset = getattr(datasets, DSETNAME2DSET[config["pipeline"]["dataset"]])(
        create_inverse_triples=config["pipeline"]["dataset_kwargs"]["create_inverse_triples"]
    )
    if "callbacks" not in config["pipeline"]["training_kwargs"]:
        config["pipeline"]["training_kwargs"]["callbacks"] = ["evaluation-loop"]
    if "callback_kwargs" not in config["pipeline"]["training_kwargs"]:
        config["pipeline"]["training_kwargs"]["callback_kwargs"] = {
            "prefix": "validation"
        }

    if "automatic_memory_optimization" in config["pipeline"]["model_kwargs"]:
        optimize_memory = config["pipeline"]["model_kwargs"].pop(
            "automatic_memory_optimization"
        )
        config["pipeline"]["training_loop_kwargs"] = {}
        config["pipeline"]["training_loop_kwargs"][
            "automatic_memory_optimization"
        ] = optimize_memory
        config["pipeline"]["evaluator_kwargs"][
            "automatic_memory_optimization"
        ] = optimize_memory
    if config["pipeline"]["training_loop"] == "owa":
        config["pipeline"]["training_loop"] = "slcwa" # Change to renamed training loop

    config["pipeline"]["training_kwargs"]["callback_kwargs"][
        "factory"
    ] = dataset.validation # Add validation dataset to callback kwargs
    config["pipeline"]["result_tracker"] = tracker
    pipeline_results = pipeline.pipeline_from_config(config)

    if use_wandb:
        tracker.log_metrics(
            metrics=pipeline_results.metric_results.to_flat_dict(),
            prefix="test",
        )
        tracker.wandb.finish()

Is there some change in the packages since it was last run that causes this mismatch, or am I perhaps using the package incorrectly? Thank you for your time, and thank you for your package!

Filco306 commented 1 year ago

To further update; I realized I constrained my search to only using sLCWA runs, so the one above does not correspond to the run presented in the paper (Table 18). However, switching to the one in the paper give me a 0.94 instead of 0.98 on ComplEx, but I think that should be good enough given that result can never be exact. RotatE also seem to give decent results on Kinship now (0.98 hits@10). But I still do not know why my results are remarkably lower for the settings above.

The validation curves for the different models are also very strange, but I guess that is a consequence of large hyperparameter tuning :)

mberr commented 1 year ago

Hi @Filco306

if you are looking at the validation curves generated by the EvaluationLoopTrainingCallback,

if "callbacks" not in config["pipeline"]["training_kwargs"]:
    config["pipeline"]["training_kwargs"]["callbacks"] = ["evaluation-loop"]
 if "callback_kwargs" not in config["pipeline"]["training_kwargs"]:
    config["pipeline"]["training_kwargs"]["callback_kwargs"] = {
        "prefix": "validation"
    }

you may be missing to filter with training triples, too. To do so, you would need to pass the additional key additional_filter_triples to callback_kwargs, i.e.,

config["pipeline"]["training_kwargs"]["callback_kwargs"] = {
    "prefix": "validation",
    "additional_filter_triples": dataset.training,
}

This is a bit hidden, since this parameter goes from the EvaluationLoopTrainingCallback.__init__ via kwargs through pykeen.evaluation.Evaluator.evaluate to pykeen.evaluation.evaluate 😅

Filco306 commented 1 year ago

Hi there,

Thank you for your reply! :D I will re-run the experiment in question with your comment in my mind and see if the fixes the results. If not, I'll get back to you :)

Thanks! :D

mberr commented 1 year ago

One more thing I noticed: https://pykeen.readthedocs.io/en/stable/api/pykeen.training.callbacks.EvaluationLoopTrainingCallback.html also needs the factory on which to evaluate, i.e.,

config["pipeline"]["training_kwargs"]["callback_kwargs"] = {
    "prefix": "validation",
    "factory": dataset.validation,
    "additional_filter_triples": dataset.training,
}
Filco306 commented 1 year ago

Hi again @mberr ,

If I add what you write,

config["pipeline"]["training_kwargs"]["callback_kwargs"] = {
    "prefix": "validation",
    "factory": dataset.validation,
    "additional_filter_triples": dataset.training,
}

I get the error:

  File "/home/users/filip/.conda/envs/kgexperiments/lib/python3.10/site-packages/pykeen/training/training_loop.py", line 378, in train
    result = self._train(
  File "/home/users/filip/.conda/envs/kgexperiments/lib/python3.10/site-packages/pykeen/training/training_loop.py", line 734, in _train
    callback.post_epoch(epoch=epoch, epoch_loss=epoch_loss)
  File "/home/users/filip/.conda/envs/kgexperiments/lib/python3.10/site-packages/pykeen/training/callbacks.py", line 438, in post_epoch
    callback.post_epoch(epoch=epoch, epoch_loss=epoch_loss, **kwargs)
  File "/home/users/filip/.conda/envs/kgexperiments/lib/python3.10/site-packages/pykeen/training/callbacks.py", line 325, in post_epoch
    result = self.evaluation_loop.evaluate(**self.kwargs)
  File "/home/users/filip/.conda/envs/kgexperiments/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/users/filip/.conda/envs/kgexperiments/lib/python3.10/site-packages/pykeen/evaluation/evaluation_loop.py", line 196, in evaluate
    return _evaluate(
  File "/home/users/filip/.conda/envs/kgexperiments/lib/python3.10/site-packages/torch_max_mem/api.py", line 293, in inner
    result, self.parameter_value[h] = wrapped(*args, **kwargs)
  File "/home/users/filip/.conda/envs/kgexperiments/lib/python3.10/site-packages/torch_max_mem/api.py", line 193, in wrapper_maximize_memory_utilization
    func(*bound_arguments.args, **p_kwargs, **bound_arguments.kwargs),
  File "/home/users/filip/.conda/envs/kgexperiments/lib/python3.10/site-packages/pykeen/evaluation/evaluation_loop.py", line 82, in _evaluate
    loader = loop.get_loader(batch_size=batch_size, **kwargs)
  File "/home/users/filip/.conda/envs/kgexperiments/lib/python3.10/site-packages/pykeen/evaluation/evaluation_loop.py", line 149, in get_loader
    return DataLoader(
TypeError: DataLoader.__init__() got an unexpected keyword argument 'additional_filter_triples'

Would you know what the issue is here? Note that I still get the warning WARNING:pykeen.evaluation.evaluation_loop:Enabled filtered evaluation, but not additional filter triples are passed., so it does not seem to be passed properly.

mberr commented 1 year ago

Okay, this seems to be a bug in EvaluationLoop, which does not properly forward this argument to instantiate the LCWAEvaluationDataset here.

I used this smaller snippet to reproduce your error

from pykeen.pipeline import pipeline
from pykeen.datasets import get_dataset

dataset = get_dataset(dataset="nations")
result = pipeline(
    dataset=dataset,
    model="mure",
    training_kwargs=dict(
        num_epochs=5,
        callbacks="evaluation-loop",
        callback_kwargs=dict(
            frequency=1,
            prefix="validation",
            factory=dataset.validation,
            additional_filter_triples=dataset.training,
        ),
    ),
)

EDIT: I opened a ticket here: https://github.com/pykeen/pykeen/issues/1213

Filco306 commented 1 year ago

Hello again @mberr ,

Thank you for this! Yes, I believe it is a bug. Thank you for flagging it!