TransE - CUDA out of memory

bolak92 commented 8 months ago

Describe the bug

Unlike the other models, when I train TransE model it fails after few epochs (around 19) with an error torch.cuda.OutOfMemoryError This was tested on several GPUs and machines but gives the same result.

Training epochs on cuda:0:   6%| | 19/300 [12:49<3:09:42, 40.51s/epoch, loss=1.6
Traceback (most recent call last):
X
  File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/pipeline/api.py", line 1546, in pipeline
    stopper_instance, configuration, losses, train_seconds = _handle_training(
  File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/pipeline/api.py", line 1190, in _handle_training
    losses = training_loop_instance.train(
  File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/training/training_loop.py", line 378, in train
    result = self._train(
  File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/training/training_loop.py", line 735, in _train
    callback.post_epoch(epoch=epoch, epoch_loss=epoch_loss)
  File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/training/callbacks.py", line 443, in post_epoch
    callback.post_epoch(epoch=epoch, epoch_loss=epoch_loss, **kwargs)
  File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/training/callbacks.py", line 367, in post_epoch
    if self.stopper.should_stop(epoch):
  File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/stoppers/early_stopping.py", line 230, in should_stop
    metric_results = self.evaluator.evaluate(
  File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/evaluation/evaluator.py", line 213, in evaluate
    rv = evaluate(
  File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/evaluation/evaluator.py", line 687, in evaluate
    relation_filter = _evaluate_batch(
  File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/evaluation/evaluator.py", line 760, in _evaluate_batch
    scores = model.predict(hrt_batch=batch, target=target, slice_size=slice_size, mode=mode)
  File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/models/base.py", line 481, in predict
    return self.predict_h(hrt_batch, **kwargs, heads=ids)
  File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/models/base.py", line 372, in predict_h
    scores = self.score_h_inverse(rt_batch=rt_batch, **kwargs)
  File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/models/base.py", line 528, in score_h_inverse
    return self.score_t(hr_batch=t_r_inv, tails=heads, **kwargs)
  File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/models/nbase.py", line 505, in score_t
    scores=self.interaction.score(h=h, r=r, t=t, slice_size=slice_size, slice_dim=1),
  File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/nn/modules.py", line 265, in score
    return self(h=h, r=r, t=t)
  File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/nn/modules.py", line 412, in forward
    return self.__class__.func(**self._prepare_for_functional(h=h, r=r, t=t))
  File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/nn/functional.py", line 754, in transe_interaction
    return negative_norm_of_sum(h, r, -t, p=p, power_norm=power_norm)
  File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/utils.py", line 652, in negative_norm_of_sum
    return negative_norm(tensor_sum(*x), p=p, power_norm=power_norm)
  File "/home/bkhalil/anaconda3/envs/env-118/lib/python3.9/site-packages/pykeen/utils.py", line 626, in tensor_sum
    return sum(_reorder(tensors=tensors))

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 14.90 GiB. GPU 0 has a total capacty of 23.70 GiB of which 7.60 GiB is free. Including non-PyTorch memory, this process has 16.10 GiB memory in use. Of the allocated memory 964.10 MiB is allocated by PyTorch, and 14.43 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

How to reproduce

result = pipeline(
    training=train,
    testing=test,
    validation=valid,
    model="TransE",
    model_kwargs={"embedding_dim": 300, "scoring_fct_norm": 1},
    epochs=300,
    stopper="early",
    stopper_kwargs={"frequency": 10, "patience": 2},
    result_tracker="wandb",
    result_tracker_kwargs=dict(project="project_name"),
    device="cuda",
)

Environment

Unable to handle parameter in CooccurrenceFilteredModel: base	Key	Value
OS	posix
Platform	Linux
Release	3.10.0-1160.15.2.el7.x86_64
Time	Sun Jan 21 22:36:32 2024
Python	3.9.18
PyKEEN	1.10.1
PyKEEN Hash	UNHASHED
PyKEEN Branch
PyTorch	2.1.2
CUDA Available?	true
CUDA Version	11.8
cuDNN Version	8700

Additional information

No response

Issue Template Checks

[X] This is not a feature request (use a different issue template if it is)
[X] This is not a question (use the discussions forum instead)
[X] I've read the text explaining why including environment information is important and understand if I omit this information that my issue will be dismissed

lukas-schwab commented 7 months ago

I believe this is a bug for other models as well. I'm running a TextRepresentation + DistMult interaction model and despite having 80G of VRAM PyKEEN still tries to allocate 14.90G more than I have. Conincidentally that's OOM by exactly the same margin as in your example.

mberr commented 7 months ago

Hi @bolak92 ,

could you try whether https://github.com/pykeen/pykeen/pull/1261 has solved your issue? It's not yet in a release but you can use it by installing from source

pip install git+https://github.com/pykeen/pykeen.git

ddofer commented 4 months ago

I can confirm that I get this same issue on the latest version, when using apple silicon/"mps" device. i.e consistently crashes on evaluation due to OOM when using "mps" (Macbook Pro M3).

pykeen / pykeen