uma-pi1 / kge

LibKGE - A knowledge graph embedding library for reproducible research
MIT License
765 stars 124 forks source link

Negative sampling still does KvsAll #250

Closed Filco306 closed 2 years ago

Filco306 commented 2 years ago

Hello!

I have this .yaml-config file for doing an AxSearchJob:

# wnrr-rotate-negative_sampling-kl
job.type: search
search.type: ax
dataset.name: wnrr

# training settings (fixed)
train:
  max_epochs: 400
  auto_correct: True

# this is faster for smaller datasets, but does not work for some models (e.g.,
# rotate due to a pytorch issue) or for larger datasets. Change to spo in such
# cases (either here or in ax section of model config), results will not be
# affected.
negative_sampling.implementation: sp_po

# validation/evaluation settings (fixed)
valid:
  every: 5
  metric: mean_reciprocal_rank_filtered_with_test
  filter_with_test: True
  early_stopping:
    patience: 10
    min_threshold.epochs: 50
    min_threshold.metric_value: 0.05

eval:
  batch_size: 256
  metrics_per.relation_type: True

# settings for reciprocal relations (if used)
import: [rotate, reciprocal_relations_model]
reciprocal_relations_model.base_model.type: rotate

# ax settings: hyperparameter serach space
ax_search:
  num_trials: 30
  num_sobol_trials: 30 
  parameters:
      # model
    - name: model
      type: choice
      values: [rotate, reciprocal_relations_model]

    # training hyperparameters
    - name: train.batch_size
      type: choice   
      values: [128, 256, 512, 1024]
      is_ordered: True
    - name: train.type
      type: fixed
      value: negative_sampling
    - name: train.optimizer
      type: choice
      values: [Adam, Adagrad]
    - name: train.loss
      type: fixed
      value: kl
    - name: train.optimizer_args.lr     
      type: range
      bounds: [0.0003, 1.0]
      log_scale: True
    - name: train.lr_scheduler
      type: fixed
      value: ReduceLROnPlateau
    - name: train.lr_scheduler_args.mode
      type: fixed
      value: max  
    - name: train.lr_scheduler_args.factor
      type: fixed
      value: 0.95  
    - name: train.lr_scheduler_args.threshold
      type: fixed
      value: 0.0001  
    - name: train.lr_scheduler_args.patience
      type: range
      bounds: [0, 10]  

    # embedding dimension
    - name: lookup_embedder.dim
      type: choice 
      values: [128, 256, 512]
      is_ordered: True

    # embedding initialization
    - name: lookup_embedder.initialize
      type: choice
      values: [xavier_normal_, xavier_uniform_, normal_, uniform_]  
    - name: lookup_embedder.initialize_args.normal_.mean
      type: fixed
      value: 0.0
    - name: lookup_embedder.initialize_args.normal_.std
      type: range
      bounds: [0.00001, 1.0]
      log_scale: True
    - name: lookup_embedder.initialize_args.uniform_.a
      type: range
      bounds: [-1.0, -0.00001]
    - name: lookup_embedder.initialize_args.xavier_uniform_.gain
      type: fixed
      value: 1.0
    - name: lookup_embedder.initialize_args.xavier_normal_.gain
      type: fixed
      value: 1.0

    # embedding regularization
    - name: lookup_embedder.regularize
      type: choice
      values: ['', 'l3', 'l2', 'l1']
      is_ordered: True
    - name: lookup_embedder.regularize_args.weighted
      type: choice
      values: [True, False]
    - name: rotate.entity_embedder.regularize_weight
      type: range
      bounds: [1.0e-20, 1.0e-01]
      log_scale: True
    - name: rotate.relation_embedder.regularize_weight
      type: range
      bounds: [1.0e-20, 1.0e-01]
      log_scale: True

    # embedding dropout
    - name: rotate.entity_embedder.dropout
      type: range
      bounds: [-0.5, 0.5]
    - name: rotate.relation_embedder.dropout
      type: range
      bounds: [-0.5, 0.5]

    # training-type specific hyperparameters
    - name: negative_sampling.num_negatives_s #train_type: negative_sampling
      type: range                             #train_type: negative_sampling
      bounds: [1, 1000]                       #train_type: negative_sampling
      log_scale: True                         #train_type: negative_sampling
    - name: negative_sampling.num_negatives_o #train_type: negative_sampling
      type: range                             #train_type: negative_sampling
      bounds: [1, 1000]                       #train_type: negative_sampling
      log_scale: True                         #train_type: negative_sampling
    - name: rotate.l_norm
      type: choice
      values: [1.0, 2.0]
      is_ordered: True
    - name: rotate.entity_embedder.normalize.p
      type: choice
      values: [-1.0, 2.0]
    - name: rotate.relation_embedder.normalize.p
      type: choice
      values: [-1.0, 2.0]
    - name: negative_sampling.implementation
      type: fixed
      value: spo

However, when I try to run it, I get OOM issues because it still performs a KvsAll training job. How do I disable the option to run KvsAll-jobs completely in the search?

Thanks!

Here is my log output:

2021-11-25 14:02:41.337593 Using folder: /home/filco306/lib-kge-fork/local/experiments/20211125-140241-ROTATE
2021-11-25 14:02:41.337666 Configuration:
2021-11-25 14:02:41.355569   1vsAll:
2021-11-25 14:02:41.355601     class_name: TrainingJob1vsAll
2021-11-25 14:02:41.355609   KvsAll:
2021-11-25 14:02:41.355626     class_name: TrainingJobKvsAll
2021-11-25 14:02:41.355634     label_smoothing: 0.0
2021-11-25 14:02:41.355642     query_types:
2021-11-25 14:02:41.355651       _po: true
2021-11-25 14:02:41.355659       s_o: false
2021-11-25 14:02:41.355667       sp_: true
2021-11-25 14:02:41.355675   ax_search:
2021-11-25 14:02:41.355684     class_name: AxSearchJob
2021-11-25 14:02:41.355692     num_sobol_trials: 30
2021-11-25 14:02:41.355700     num_trials: 30
2021-11-25 14:02:41.355708     parameter_constraints: []
2021-11-25 14:02:41.355716     parameters:
2021-11-25 14:02:41.355724     - name: model
2021-11-25 14:02:41.355732       type: choice
2021-11-25 14:02:41.355740       values:
2021-11-25 14:02:41.355749       - rotate
2021-11-25 14:02:41.355757       - reciprocal_relations_model
2021-11-25 14:02:41.355765     - is_ordered: true
2021-11-25 14:02:41.355773       name: train.batch_size
2021-11-25 14:02:41.355781       type: choice
2021-11-25 14:02:41.355789       values:
2021-11-25 14:02:41.355797       - 128
2021-11-25 14:02:41.355805       - 256
2021-11-25 14:02:41.355813       - 512
2021-11-25 14:02:41.355821       - 1024
2021-11-25 14:02:41.355830     - name: train.type
2021-11-25 14:02:41.355838       type: fixed
2021-11-25 14:02:41.355846       value: negative_sampling
2021-11-25 14:02:41.355854     - name: train.optimizer
2021-11-25 14:02:41.355862       type: choice
2021-11-25 14:02:41.355870       values:
2021-11-25 14:02:41.355878       - Adam
2021-11-25 14:02:41.355886       - Adagrad
2021-11-25 14:02:41.355894     - name: train.loss
2021-11-25 14:02:41.355902       type: fixed
2021-11-25 14:02:41.355910       value: kl
2021-11-25 14:02:41.355918     - bounds:
2021-11-25 14:02:41.355927       - 0.0003
2021-11-25 14:02:41.355935       - 1.0
2021-11-25 14:02:41.355943       log_scale: true
2021-11-25 14:02:41.355951       name: train.optimizer_args.lr
2021-11-25 14:02:41.355959       type: range
2021-11-25 14:02:41.355967     - name: train.lr_scheduler
2021-11-25 14:02:41.355975       type: fixed
2021-11-25 14:02:41.355983       value: ReduceLROnPlateau
2021-11-25 14:02:41.355992     - name: train.lr_scheduler_args.mode
2021-11-25 14:02:41.356000       type: fixed
2021-11-25 14:02:41.356008       value: max
2021-11-25 14:02:41.356016     - name: train.lr_scheduler_args.factor
2021-11-25 14:02:41.356024       type: fixed
2021-11-25 14:02:41.356032       value: 0.95
2021-11-25 14:02:41.356040     - name: train.lr_scheduler_args.threshold
2021-11-25 14:02:41.356048       type: fixed
2021-11-25 14:02:41.356057       value: 0.0001
2021-11-25 14:02:41.356065     - bounds:
2021-11-25 14:02:41.356075       - 0
2021-11-25 14:02:41.356084       - 10
2021-11-25 14:02:41.356092       name: train.lr_scheduler_args.patience
2021-11-25 14:02:41.356100       type: range
2021-11-25 14:02:41.356109     - is_ordered: true
2021-11-25 14:02:41.356117       name: lookup_embedder.dim
2021-11-25 14:02:41.356125       type: choice
2021-11-25 14:02:41.356134       values:
2021-11-25 14:02:41.356142       - 128
2021-11-25 14:02:41.356150       - 256
2021-11-25 14:02:41.356158       - 512
2021-11-25 14:02:41.356166     - name: lookup_embedder.initialize
2021-11-25 14:02:41.356187       type: choice
2021-11-25 14:02:41.356196       values:
2021-11-25 14:02:41.356204       - xavier_normal_
2021-11-25 14:02:41.356212       - xavier_uniform_
2021-11-25 14:02:41.356220       - normal_
2021-11-25 14:02:41.356228       - uniform_
2021-11-25 14:02:41.356237     - name: lookup_embedder.initialize_args.normal_.mean
2021-11-25 14:02:41.356245       type: fixed
2021-11-25 14:02:41.356253       value: 0.0
2021-11-25 14:02:41.356261     - bounds:
2021-11-25 14:02:41.356270       - 1.0e-05
2021-11-25 14:02:41.356278       - 1.0
2021-11-25 14:02:41.356287       log_scale: true
2021-11-25 14:02:41.356295       name: lookup_embedder.initialize_args.normal_.std
2021-11-25 14:02:41.356303       type: range
2021-11-25 14:02:41.356311     - bounds:
2021-11-25 14:02:41.356319       - -1.0
2021-11-25 14:02:41.356327       - -1.0e-05
2021-11-25 14:02:41.356338       name: lookup_embedder.initialize_args.uniform_.a
2021-11-25 14:02:41.356345       type: range
2021-11-25 14:02:41.356352     - name: lookup_embedder.initialize_args.xavier_uniform_.gain
2021-11-25 14:02:41.356359       type: fixed
2021-11-25 14:02:41.356366       value: 1.0
2021-11-25 14:02:41.356372     - name: lookup_embedder.initialize_args.xavier_normal_.gain
2021-11-25 14:02:41.356380       type: fixed
2021-11-25 14:02:41.356394       value: 1.0
2021-11-25 14:02:41.356402     - is_ordered: true
2021-11-25 14:02:41.356429       name: lookup_embedder.regularize
2021-11-25 14:02:41.356438       type: choice
2021-11-25 14:02:41.356447       values:
2021-11-25 14:02:41.356456       - ''
2021-11-25 14:02:41.356468       - l3
2021-11-25 14:02:41.356481       - l2
2021-11-25 14:02:41.356493       - l1
2021-11-25 14:02:41.356506     - name: lookup_embedder.regularize_args.weighted
2021-11-25 14:02:41.356520       type: choice
2021-11-25 14:02:41.356535       values:
2021-11-25 14:02:41.356548       - true
2021-11-25 14:02:41.356561       - false
2021-11-25 14:02:41.356575     - bounds:
2021-11-25 14:02:41.356591       - 1.0e-20
2021-11-25 14:02:41.356604       - 0.1
2021-11-25 14:02:41.356618       log_scale: true
2021-11-25 14:02:41.356631       name: rotate.entity_embedder.regularize_weight
2021-11-25 14:02:41.356645       type: range
2021-11-25 14:02:41.356659     - bounds:
2021-11-25 14:02:41.356673       - 1.0e-20
2021-11-25 14:02:41.356689       - 0.1
2021-11-25 14:02:41.356704       log_scale: true
2021-11-25 14:02:41.356719       name: rotate.relation_embedder.regularize_weight
2021-11-25 14:02:41.356732       type: range
2021-11-25 14:02:41.356747     - bounds:
2021-11-25 14:02:41.356761       - -0.5
2021-11-25 14:02:41.356775       - 0.5
2021-11-25 14:02:41.356790       name: rotate.entity_embedder.dropout
2021-11-25 14:02:41.356805       type: range
2021-11-25 14:02:41.356820     - bounds:
2021-11-25 14:02:41.356837       - -0.5
2021-11-25 14:02:41.356853       - 0.5
2021-11-25 14:02:41.356868       name: rotate.relation_embedder.dropout
2021-11-25 14:02:41.356884       type: range
2021-11-25 14:02:41.356900     - bounds:
2021-11-25 14:02:41.356914       - 1
2021-11-25 14:02:41.356927       - 1000
2021-11-25 14:02:41.356942       log_scale: true
2021-11-25 14:02:41.356956       name: negative_sampling.num_negatives_s
2021-11-25 14:02:41.356971       type: range
2021-11-25 14:02:41.356986     - bounds:
2021-11-25 14:02:41.357001       - 1
2021-11-25 14:02:41.357016       - 1000
2021-11-25 14:02:41.357031       log_scale: true
2021-11-25 14:02:41.357046       name: negative_sampling.num_negatives_o
2021-11-25 14:02:41.357061       type: range
2021-11-25 14:02:41.357076     - is_ordered: true
2021-11-25 14:02:41.357090       name: rotate.l_norm
2021-11-25 14:02:41.357104       type: choice
2021-11-25 14:02:41.357120       values:
2021-11-25 14:02:41.357135       - 1.0
2021-11-25 14:02:41.357149       - 2.0
2021-11-25 14:02:41.357164     - name: rotate.entity_embedder.normalize.p
2021-11-25 14:02:41.357179       type: choice
2021-11-25 14:02:41.357193       values:
2021-11-25 14:02:41.357208       - -1.0
2021-11-25 14:02:41.357223       - 2.0
2021-11-25 14:02:41.357237     - name: rotate.relation_embedder.normalize.p
2021-11-25 14:02:41.357252       type: choice
2021-11-25 14:02:41.357266       values:
2021-11-25 14:02:41.357280       - -1.0
2021-11-25 14:02:41.357295       - 2.0
2021-11-25 14:02:41.357310     - name: negative_sampling.implementation
2021-11-25 14:02:41.357324       type: fixed
2021-11-25 14:02:41.357339       value: spo
2021-11-25 14:02:41.357353     sobol_seed: 0
2021-11-25 14:02:41.357367   console:
2021-11-25 14:02:41.357382     format: {}
2021-11-25 14:02:41.357396     quiet: false
2021-11-25 14:02:41.357411   conve:
2021-11-25 14:02:41.357426     2D_aspect_ratio: 2
2021-11-25 14:02:41.357440     class_name: ConvE
2021-11-25 14:02:41.357455     convolution_bias: true
2021-11-25 14:02:41.357506     entity_embedder:
2021-11-25 14:02:41.357522       +++: +++
2021-11-25 14:02:41.357537       dropout: 0.2
2021-11-25 14:02:41.357551       type: lookup_embedder
2021-11-25 14:02:41.357566     feature_map_dropout: 0.2
2021-11-25 14:02:41.357580     filter_size: 3
2021-11-25 14:02:41.357595     padding: 0
2021-11-25 14:02:41.357610     projection_dropout: 0.3
2021-11-25 14:02:41.357625     relation_embedder:
2021-11-25 14:02:41.357639       +++: +++
2021-11-25 14:02:41.357655       dropout: 0.2
2021-11-25 14:02:41.357669       type: lookup_embedder
2021-11-25 14:02:41.357684     round_dim: false
2021-11-25 14:02:41.357699     stride: 1
2021-11-25 14:02:41.357714   dataset:
2021-11-25 14:02:41.357727     +++: +++
2021-11-25 14:02:41.357741     files:
2021-11-25 14:02:41.357755       +++: +++
2021-11-25 14:02:41.357770       entity_ids:
2021-11-25 14:02:41.357784         filename: entity_ids.del
2021-11-25 14:02:41.357805         type: map
2021-11-25 14:02:41.357819       entity_strings:
2021-11-25 14:02:41.357834         filename: entity_ids.del
2021-11-25 14:02:41.357849         type: map
2021-11-25 14:02:41.357864       relation_ids:
2021-11-25 14:02:41.357878         filename: relation_ids.del
2021-11-25 14:02:41.357893         type: map
2021-11-25 14:02:41.357908       relation_strings:
2021-11-25 14:02:41.357923         filename: relation_ids.del
2021-11-25 14:02:41.357949         type: map
2021-11-25 14:02:41.357963       test:
2021-11-25 14:02:41.357977         filename: test.del
2021-11-25 14:02:41.357990         type: triples
2021-11-25 14:02:41.358004       train:
2021-11-25 14:02:41.358018         filename: train.del
2021-11-25 14:02:41.358031         type: triples
2021-11-25 14:02:41.358045       valid:
2021-11-25 14:02:41.358059         filename: valid.del
2021-11-25 14:02:41.358073         type: triples
2021-11-25 14:02:41.358087     name: wnrr
2021-11-25 14:02:41.358101     num_entities: -1
2021-11-25 14:02:41.358114     num_relations: -1
2021-11-25 14:02:41.358127     pickle: true
2021-11-25 14:02:41.358139   entity_ranking:
2021-11-25 14:02:41.358153     chunk_size: -1
2021-11-25 14:02:41.358166     class_name: EntityRankingJob
2021-11-25 14:02:41.358179     filter_splits:
2021-11-25 14:02:41.358191     - train
2021-11-25 14:02:41.358204     - valid
2021-11-25 14:02:41.358234     filter_with_test: true
2021-11-25 14:02:41.358248     hits_at_k_s:
2021-11-25 14:02:41.358263     - 1
2021-11-25 14:02:41.358277     - 3
2021-11-25 14:02:41.358291     - 10
2021-11-25 14:02:41.358307     - 50
2021-11-25 14:02:41.358321     - 100
2021-11-25 14:02:41.358337     - 200
2021-11-25 14:02:41.358352     - 300
2021-11-25 14:02:41.358367     - 400
2021-11-25 14:02:41.358382     - 500
2021-11-25 14:02:41.358397     - 1000
2021-11-25 14:02:41.358412     metrics_per:
2021-11-25 14:02:41.358427       argument_frequency: false
2021-11-25 14:02:41.358442       head_and_tail: false
2021-11-25 14:02:41.358457       relation_type: true
2021-11-25 14:02:41.358472     tie_handling: rounded_mean_rank
2021-11-25 14:02:41.358487   eval:
2021-11-25 14:02:41.358502     batch_size: 256
2021-11-25 14:02:41.358518     num_workers: 0
2021-11-25 14:02:41.358534     pin_memory: false
2021-11-25 14:02:41.358549     split: valid
2021-11-25 14:02:41.358564     trace_level: epoch
2021-11-25 14:02:41.358578     type: entity_ranking
2021-11-25 14:02:41.358593   grid_search:
2021-11-25 14:02:41.358608     class_name: GridSearchJob
2021-11-25 14:02:41.358623     parameters:
2021-11-25 14:02:41.358637       +++: +++
2021-11-25 14:02:41.358652     run: true
2021-11-25 14:02:41.358666   import:
2021-11-25 14:02:41.358682   - rotate
2021-11-25 14:02:41.358696   - reciprocal_relations_model
2021-11-25 14:02:41.358711   job:
2021-11-25 14:02:41.358726     device: cuda
2021-11-25 14:02:41.358740     type: search
2021-11-25 14:02:41.358754   lookup_embedder:
2021-11-25 14:02:41.358769     class_name: LookupEmbedder
2021-11-25 14:02:41.358784     dim: 100
2021-11-25 14:02:41.358798     dropout: 0.0
2021-11-25 14:02:41.358813     initialize: normal_
2021-11-25 14:02:41.358827     initialize_args:
2021-11-25 14:02:41.358842       +++: +++
2021-11-25 14:02:41.358857     normalize:
2021-11-25 14:02:41.358872       p: -1.0
2021-11-25 14:02:41.358886     pretrain:
2021-11-25 14:02:41.358901       ensure_all: false
2021-11-25 14:02:41.358915       model_filename: ''
2021-11-25 14:02:41.358929     regularize: lp
2021-11-25 14:02:41.358944     regularize_args:
2021-11-25 14:02:41.358958       +++: +++
2021-11-25 14:02:41.358973       p: 2
2021-11-25 14:02:41.358987       weighted: false
2021-11-25 14:02:41.359002     regularize_weight: 0.0
2021-11-25 14:02:41.359016     round_dim_to: []
2021-11-25 14:02:41.359030     sparse: false
2021-11-25 14:02:41.359045   manual_search:
2021-11-25 14:02:41.359059     class_name: ManualSearchJob
2021-11-25 14:02:41.359073     configurations: []
2021-11-25 14:02:41.359088     run: true
2021-11-25 14:02:41.359103   model: ''
2021-11-25 14:02:41.359118   modules:
2021-11-25 14:02:41.359132   - kge.job
2021-11-25 14:02:41.359147   - kge.model
2021-11-25 14:02:41.359163   - kge.model.embedder
2021-11-25 14:02:41.359177   negative_sampling:
2021-11-25 14:02:41.359191     class_name: TrainingJobNegativeSampling
2021-11-25 14:02:41.359206     filtering:
2021-11-25 14:02:41.359220       implementation: fast_if_available
2021-11-25 14:02:41.359235       o: false
2021-11-25 14:02:41.359249       p: false
2021-11-25 14:02:41.359263       s: false
2021-11-25 14:02:41.359278       split: ''
2021-11-25 14:02:41.359292     frequency:
2021-11-25 14:02:41.359307       smoothing: 1
2021-11-25 14:02:41.359321     implementation: batch
2021-11-25 14:02:41.359336     num_samples:
2021-11-25 14:02:41.359350       o: -1
2021-11-25 14:02:41.359364       p: 0
2021-11-25 14:02:41.359378       s: 3
2021-11-25 14:02:41.359393     sampling_type: uniform
2021-11-25 14:02:41.359408     shared: false
2021-11-25 14:02:41.359421     shared_type: default
2021-11-25 14:02:41.359436     with_replacement: true
2021-11-25 14:02:41.359451   random_seed:
2021-11-25 14:02:41.359465     default: -1
2021-11-25 14:02:41.359479     numba: -1
2021-11-25 14:02:41.359493     numpy: -1
2021-11-25 14:02:41.359508     python: -1
2021-11-25 14:02:41.359522     torch: -1
2021-11-25 14:02:41.359536   reciprocal_relations_model:
2021-11-25 14:02:41.359551     base_model:
2021-11-25 14:02:41.359565       +++: +++
2021-11-25 14:02:41.359579       type: rotate
2021-11-25 14:02:41.359594     class_name: ReciprocalRelationsModel
2021-11-25 14:02:41.359608   rotate:
2021-11-25 14:02:41.359622     class_name: RotatE
2021-11-25 14:02:41.359644     entity_embedder:
2021-11-25 14:02:41.359658       +++: +++
2021-11-25 14:02:41.359673       type: lookup_embedder
2021-11-25 14:02:41.359687     l_norm: 1.0
2021-11-25 14:02:41.359702     normalize_phases: true
2021-11-25 14:02:41.359716     relation_embedder:
2021-11-25 14:02:41.359730       +++: +++
2021-11-25 14:02:41.359746       dim: -1
2021-11-25 14:02:41.359760       initialize: uniform_
2021-11-25 14:02:41.359774       initialize_args:
2021-11-25 14:02:41.359788         uniform_:
2021-11-25 14:02:41.359802           a: -3.14159265359
2021-11-25 14:02:41.359818           b: 3.14159265359
2021-11-25 14:02:41.359832       type: lookup_embedder
2021-11-25 14:02:41.359846   search:
2021-11-25 14:02:41.359856     device_pool: []
2021-11-25 14:02:41.359865     num_workers: 1
2021-11-25 14:02:41.359874     on_error: abort
2021-11-25 14:02:41.359882     type: ax_search
2021-11-25 14:02:41.359891   train:
2021-11-25 14:02:41.359901     abort_on_nan: true
2021-11-25 14:02:41.359909     auto_correct: true
2021-11-25 14:02:41.359918     batch_size: 100
2021-11-25 14:02:41.359927     checkpoint:
2021-11-25 14:02:41.359935       every: 5
2021-11-25 14:02:41.359945       keep: 3
2021-11-25 14:02:41.359954       keep_init: true
2021-11-25 14:02:41.359963     loss: kl
2021-11-25 14:02:41.359972     loss_arg: .nan
2021-11-25 14:02:41.359981     lr_scheduler: ''
2021-11-25 14:02:41.359989     lr_scheduler_args:
2021-11-25 14:02:41.360023       +++: +++
2021-11-25 14:02:41.360032     lr_warmup: 0
2021-11-25 14:02:41.360041     max_epochs: 400
2021-11-25 14:02:41.360051     num_workers: 0
2021-11-25 14:02:41.360060     optimizer:
2021-11-25 14:02:41.360070       +++: +++
2021-11-25 14:02:41.360078       default:
2021-11-25 14:02:41.360087         args:
2021-11-25 14:02:41.360096           +++: +++
2021-11-25 14:02:41.360104         type: Adagrad
2021-11-25 14:02:41.360113     pin_memory: false
2021-11-25 14:02:41.360122     split: train
2021-11-25 14:02:41.360130     subbatch_auto_tune: false
2021-11-25 14:02:41.360139     subbatch_size: -1
2021-11-25 14:02:41.360148     trace_level: epoch
2021-11-25 14:02:41.360156     type: KvsAll <!--- HERE!-->
2021-11-25 14:02:41.360166     visualize_graph: false
2021-11-25 14:02:41.360174   training_loss:
2021-11-25 14:02:41.360183     class_name: TrainingLossEvaluationJob
2021-11-25 14:02:41.360192   user:
2021-11-25 14:02:41.360201     +++: +++
2021-11-25 14:02:41.360211   valid:
2021-11-25 14:02:41.360220     early_stopping:
2021-11-25 14:02:41.360228       patience: 10
2021-11-25 14:02:41.360237       threshold:
2021-11-25 14:02:41.360246         epochs: 50
2021-11-25 14:02:41.360255         metric_value: 0.05
2021-11-25 14:02:41.360263     every: 5
2021-11-25 14:02:41.360278     metric: mean_reciprocal_rank_filtered_with_test
2021-11-25 14:02:41.360287     metric_expr: float("nan")
2021-11-25 14:02:41.360296     metric_max: true
2021-11-25 14:02:41.360306     split: valid
2021-11-25 14:02:41.360315     trace_level: epoch
2021-11-25 14:02:41.373541   git commit: 79a857f
2021-11-25 14:02:41.373793 Loading configuration of dataset wnrr from /home/filco306/lib-kge-fork/data/wnrr ...
2021-11-25 14:02:41.383877 Loaded 41105 keys from map entity_ids
2021-11-25 14:02:41.384033 Loaded 11 keys from map relation_ids
2021-11-25 14:02:41.385474 Loaded 86835 train triples
2021-11-25 14:02:41.385765 Loaded 3034 valid triples
2021-11-25 14:02:41.386038 Loaded 3134 test triples
2021-11-25 14:02:41.386328 [6361cf82] Using device pool: ['cuda']
2021-11-25 14:02:41.420615 [6361cf82] ax search initialized with GenerationStrategy(name='Sobol+GPEI', steps=[Sobol for 30 trials, GPEI for subsequent trials])
2021-11-25 14:02:41.576644 [6361cf82] Registering trial 0/29...
2021-11-25 14:02:41.599978 [6361cf82] Created trial 00000 with parameters: {'train.batch_size': 512, 'train.optimizer_args.lr': 0.0038263341389451326, 'train.lr_scheduler_args.patience': 10, 'lookup_embedder.dim': 128, 'lookup_embedder.initialize_args.normal_.std': 0.2594381804418265, 'lookup_embedder.initialize_args.uniform_.a': -0.11983964847803108, 'lookup_embedder.regularize': 'l1', 'rotate.entity_embedder.regularize_weight': 8.497092712317215e-14, 'rotate.relation_embedder.regularize_weight': 0.012227634287835265, 'rotate.entity_embedder.dropout': -0.27799247205257416, 'rotate.relation_embedder.dropout': 0.19008886814117432, 'negative_sampling.num_negatives_s': 980, 'negative_sampling.num_negatives_o': 59, 'rotate.l_norm': 1.0, 'model': 'reciprocal_relations_model', 'train.optimizer': 'Adagrad', 'lookup_embedder.initialize': 'xavier_normal_', 'lookup_embedder.regularize_args.weighted': True, 'rotate.entity_embedder.normalize.p': 2.0, 'rotate.relation_embedder.normalize.p': 2.0, 'train.type': 'negative_sampling', 'train.loss': 'kl', 'train.lr_scheduler': 'ReduceLROnPlateau', 'train.lr_scheduler_args.mode': 'max', 'train.lr_scheduler_args.factor': 0.95, 'train.lr_scheduler_args.threshold': 0.0001, 'lookup_embedder.initialize_args.normal_.mean': 0.0, 'lookup_embedder.initialize_args.xavier_uniform_.gain': 1.0, 'lookup_embedder.initialize_args.xavier_normal_.gain': 1.0, 'negative_sampling.implementation': 'spo'}
2021-11-25 14:02:41.621957 [6361cf82] Saving checkpoint to /home/filco306/lib-kge-fork/local/experiments/20211125-140241-ROTATE/checkpoint_00001.pt...
2021-11-25 14:02:41.622286 [6361cf82] Starting training job /home/filco306/lib-kge-fork/local/experiments/20211125-140241-ROTATE/00000 (1/30) on device cuda...
2021-11-25 14:03:45.624664 [6361cf82]   62633 distinct sp pairs in train
2021-11-25 14:03:45.659191 [6361cf82]   40996 distinct po pairs in train
2021-11-25 14:03:45.662284 [6361cf82]   2916 distinct sp pairs in valid
2021-11-25 14:03:45.664440 [6361cf82]   2646 distinct po pairs in valid
2021-11-25 14:03:45.666692 [6361cf82]   3022 distinct sp pairs in test
2021-11-25 14:03:45.668840 [6361cf82]   2694 distinct po pairs in test
2021-11-25 14:03:46.469058 [6361cf82] Trial 00000 failed: RuntimeError('The following operation failed in the TorchScript interpreter.\nTraceback of TorchScript (most recent call last):\n  File "/home/filco306/lib-kge-fork/kge/model/rotate.py", line 201, in abs_complex\n    "Compute magnitude of given complex numbers"\n    x_re_im = torch.stack((x_re, x_im), dim=0)  # dim0: real, imaginary\n    return torch.norm(x_re_im, dim=0)  # sqrt(real^2+imaginary^2)\n           ~~~~~~~~~~ <--- HERE\n  File "/home/filco306/.envs/libkge-env/lib/python3.8/site-packages/torch/functional.py", line 1333, in norm\n                _dim = list(range(ndim))\n            if out is None:\n                return _VF.frobenius_norm(input, _dim, keepdim=keepdim)\n                       ~~~~~~~~~~~~~~~~~~ <--- HERE\n            else:\n                return _VF.frobenius_norm(input, _dim, keepdim=keepdim, out=out)\nRuntimeError: CUDA out of memory. Tried to allocate 2.51 GiB (GPU 0; 15.72 GiB total capacity; 12.63 GiB already allocated; 1.95 GiB free; 12.71 GiB reserved in total by PyTorch)\n')
2021-11-25 14:03:46.469158 [6361cf82] Aborting search due to failure of trial 00000
2021-11-25 14:03:46.470588 [6361cf82] Traceback (most recent call last):
2021-11-25 14:03:46.470597 [6361cf82]   File "/home/filco306/lib-kge-fork/kge/cli.py", line 285, in main
2021-11-25 14:03:46.470600 [6361cf82]     job.run()
2021-11-25 14:03:46.470603 [6361cf82]   File "/home/filco306/lib-kge-fork/kge/job/job.py", line 159, in run
2021-11-25 14:03:46.470605 [6361cf82]     result = self._run()
2021-11-25 14:03:46.470608 [6361cf82]   File "/home/filco306/lib-kge-fork/kge/job/search_auto.py", line 160, in _run
2021-11-25 14:03:46.470610 [6361cf82]     self.submit_task(
2021-11-25 14:03:46.470613 [6361cf82]   File "/home/filco306/lib-kge-fork/kge/job/search.py", line 70, in submit_task
2021-11-25 14:03:46.470616 [6361cf82]     self.ready_task_results.append(task(task_arg, device=self.free_devices[0]))
2021-11-25 14:03:46.470618 [6361cf82]   File "/home/filco306/lib-kge-fork/kge/job/search.py", line 232, in _run_train_job
2021-11-25 14:03:46.470621 [6361cf82]     raise e
2021-11-25 14:03:46.470624 [6361cf82]   File "/home/filco306/lib-kge-fork/kge/job/search.py", line 186, in _run_train_job
2021-11-25 14:03:46.470626 [6361cf82]     job.run()
2021-11-25 14:03:46.470629 [6361cf82]   File "/home/filco306/lib-kge-fork/kge/job/job.py", line 159, in run
2021-11-25 14:03:46.470631 [6361cf82]     result = self._run()
2021-11-25 14:03:46.470633 [6361cf82]   File "/home/filco306/lib-kge-fork/kge/job/train.py", line 224, in _run
2021-11-25 14:03:46.470636 [6361cf82]     trace_entry = self.valid_job.run()
2021-11-25 14:03:46.470639 [6361cf82]   File "/home/filco306/lib-kge-fork/kge/job/job.py", line 159, in run
2021-11-25 14:03:46.470641 [6361cf82]     result = self._run()
2021-11-25 14:03:46.470643 [6361cf82]   File "/home/filco306/lib-kge-fork/kge/job/eval.py", line 67, in _run
2021-11-25 14:03:46.470646 [6361cf82]     self._evaluate()
2021-11-25 14:03:46.470649 [6361cf82]   File "/home/filco306/.envs/libkge-env/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
2021-11-25 14:03:46.470651 [6361cf82]     return func(*args, **kwargs)
2021-11-25 14:03:46.470654 [6361cf82]   File "/home/filco306/lib-kge-fork/kge/job/eval_entity_ranking.py", line 207, in _evaluate
2021-11-25 14:03:46.470656 [6361cf82]     scores = self.model.score_sp_po(
2021-11-25 14:03:46.470659 [6361cf82]   File "/home/filco306/lib-kge-fork/kge/model/reciprocal_relations_model.py", line 98, in score_sp_po
2021-11-25 14:03:46.470661 [6361cf82]     sp_scores = self._scorer.score_emb(s, p, all_entities, combine="sp_")
2021-11-25 14:03:46.470664 [6361cf82]   File "/home/filco306/lib-kge-fork/kge/model/rotate.py", line 50, in score_emb
2021-11-25 14:03:46.470666 [6361cf82]     diff_abs = abs_complex(diff_re, diff_im)  # sp x o x dim
2021-11-25 14:03:46.470669 [6361cf82] RuntimeError: The following operation failed in the TorchScript interpreter.
2021-11-25 14:03:46.470671 [6361cf82] Traceback of TorchScript (most recent call last):
2021-11-25 14:03:46.470674 [6361cf82]   File "/home/filco306/lib-kge-fork/kge/model/rotate.py", line 201, in abs_complex
2021-11-25 14:03:46.470676 [6361cf82]     "Compute magnitude of given complex numbers"
2021-11-25 14:03:46.470678 [6361cf82]     x_re_im = torch.stack((x_re, x_im), dim=0)  # dim0: real, imaginary
2021-11-25 14:03:46.470681 [6361cf82]     return torch.norm(x_re_im, dim=0)  # sqrt(real^2+imaginary^2)
2021-11-25 14:03:46.470684 [6361cf82]            ~~~~~~~~~~ <--- HERE
2021-11-25 14:03:46.470686 [6361cf82]   File "/home/filco306/.envs/libkge-env/lib/python3.8/site-packages/torch/functional.py", line 1333, in norm
2021-11-25 14:03:46.470688 [6361cf82]                 _dim = list(range(ndim))
2021-11-25 14:03:46.470691 [6361cf82]             if out is None:
2021-11-25 14:03:46.470693 [6361cf82]                 return _VF.frobenius_norm(input, _dim, keepdim=keepdim)
2021-11-25 14:03:46.470696 [6361cf82]                        ~~~~~~~~~~~~~~~~~~ <--- HERE
2021-11-25 14:03:46.470698 [6361cf82]             else:
2021-11-25 14:03:46.470701 [6361cf82]                 return _VF.frobenius_norm(input, _dim, keepdim=keepdim, out=out)
2021-11-25 14:03:46.470703 [6361cf82] RuntimeError: CUDA out of memory. Tried to allocate 2.51 GiB (GPU 0; 15.72 GiB total capacity; 12.63 GiB already allocated; 1.95 GiB free; 12.71 GiB reserved in total by PyTorch)
2021-11-25 14:03:46.470706 [6361cf82] 
AdrianKs commented 2 years ago

Hi, I think the problem is not that you use KvsAll instead of negative sampling but rather that the evaluation yields OOM. We score against all entities by default during evaluation. Especially for rotate this leads to a high memory consumption. You can reduce the memory overhead during evaluation by only scoring against chunks of all entities. You can do so with the option:

entity_ranking.chunk_size: 5000

The smaller the chunk size the smaller the memory consumption. You can find the relevant documentation here: https://github.com/uma-pi1/kge/blob/ed53b69aff350de33b236736c86e1ac4e33e3421/kge/config-default.yaml#L529

rgemulla commented 2 years ago

I agree. Note that the KvsAll line that you marked is the one form the default config, which is later modified by the search. You can see the actual config being used in the folder of the trial; all changes are also printed in the console after "Created trial ... with parameters".

Filco306 commented 2 years ago

I feel stupid now. Thank you, you are entirely correct. Still some things to learn about this super nice package :)