uma-pi1 / kgt5-context

11 stars 3 forks source link

probability tensor error occurred in the training process #3

Closed LebrontoJ closed 4 months ago

LebrontoJ commented 4 months ago

When I tried to run your model on the FB15k-237 dataset, there were errors about the probability tensor every time after serveral epochs(the exact number of epochs varied all the time). Do you have any idea about this? 屏幕截图 2024-05-13 095811 屏幕截图 2024-05-13 100023 屏幕截图 2024-05-13 095635

AdrianKs commented 4 months ago

Hi, could you please provide the configurations you used?

LebrontoJ commented 4 months ago

dataset: name: FB15k-237 # wikidata5m_v3

use original KGT5 without context

v1: False

use deprecated input format. Necessary to properly evaluate old models.

is_legacy: False model: name: t5-small tokenizer_type: t5 max_input_length: 512 max_output_length: 40 context: use: True max_size: 100 shuffle: True descriptions: use: False train: batch_size: 4 max_epochs: 100 drop_subject: 0.0 num_workers: 4 precision: 16 accelerator: auto devices: auto strategy: ddp_find_unused_parameters_false

strategy: ddp

eval: num_predictions: 100 max_length: 40 batch_size: 1 valid: every: 1 tiny: False # True checkpoint: keep_top_k: 3

resume_from: "" wandb: use: False project_name: kgt5context${dataset.name} run_name: v1=${dataset.v1}_desc=${descriptions.use}_bs=${train.batch_size}

hydra: job: chdir: False run: dir: ./outputs/${dataset.name}/v1=${dataset.v1}/descriptions=${descriptions.use}/${now:%Y-%m-%d-%H-%M}

dir: ./tmp

AdrianKs commented 4 months ago

Hi, I think the main problem is the small batch size. Yesterday I tried the same setting on a single GPU but with batch size 96 and it worked out fine. With even larger total batch sizes it should be more stable. But to be sure, I uploaded the dataset here, so we work on the same one.

If neither the batch size nor the dataset help, I suggest to use some small weight decay additionally. You can do so by changing this line: https://github.com/uma-pi1/kgt5-context/blob/f9b9272e19a6855746871385b1113fdeb14c18aa/kgt5_model.py#L44

to

optimizer = Adafactor(self.parameters(), scale_parameter=True, relative_step=True, warmup_init=True, lr=None, weight_decay=0.00001)
LebrontoJ commented 4 months ago

Thank you for your help! I've added the batch size and hope it will work well

AdrianKs commented 4 months ago

Closing the issue for now. Feel free to reopen if the issue persists.