lxucs / coref-hoi

PyTorch implementation of the end-to-end coreference resolution model with different higher-order inference methods.
Apache License 2.0
59 stars 19 forks source link

Training issue: with bert_base #9

Closed sushantakpani closed 3 years ago

sushantakpani commented 3 years ago

Hi @lxucs,

I want to train a model for bert_base with no HOI like the spanbert_large_ml0_d1 model

python run.py bert_base 0

Got this issue:


Traceback (most recent call last): File "run.py", line 289, in model = runner.initialize_model() File "run.py", line 51, in initialize_model model = CorefModel(self.config, self.device) File "/VL/space/sushantakp/research_work/coref-hoi/model.py", line 33, in init self.bert = BertModel.from_pretrained(config['bert_pretrained_name_or_path']) File "/VL/space/sushantakp/.conda/envs/skp_env376/lib/python3.7/site-packages/transformers/modeling_utils.py", line 935, in from_pretrained raise EnvironmentError(msg) OSError: Can't load weights for 'bert-base-cased'. Make sure that:

Is it needed to change any parameter in experiments.conf ?

sushantakpani commented 3 years ago

python run.py train_bert_base_ml0_d1 0 also gave the same result

whereas

python run.py train_spanbert_base_ml0_d1 0 progressed for training but halted due to cuda out of memory issue. RuntimeError: CUDA out of memory. Tried to allocate 630.00 MiB (GPU 5; 10.76 GiB total capacity; 8.44 GiB already allocated; 315.12 MiB free; 9.60 GiB reserved in total by PyTorch)

AradAshrafi commented 3 years ago

python run.py train_bert_base_ml0_d1 0 also gave the same result

whereas

python run.py train_spanbert_base_ml0_d1 0 progressed for training but halted due to cuda out of memory issue. RuntimeError: CUDA out of memory. Tried to allocate 630.00 MiB (GPU 5; 10.76 GiB total capacity; 8.44 GiB already allocated; 315.12 MiB free; 9.60 GiB reserved in total by PyTorch)

Hi,

I solve these kinds of issues by changing some parameters in the experiments.conf file, in order to decrease the size of the model. For example, you can decrease the ffnn_size, max_segment_len.

sushantakpani commented 3 years ago

python run.py train_bert_base_ml0_d1 0 also gave the same result whereas python run.py train_spanbert_base_ml0_d1 0 progressed for training but halted due to cuda out of memory issue. RuntimeError: CUDA out of memory. Tried to allocate 630.00 MiB (GPU 5; 10.76 GiB total capacity; 8.44 GiB already allocated; 315.12 MiB free; 9.60 GiB reserved in total by PyTorch)

Hi,

I solve these kinds of issues by changing some parameters in the experiments.conf file, in order to decrease the size of the model. For example, you can decrease the ffnn_size, max_segment_len.

Hi @AradAshrafi, Thanks for your response and tips to solve the Cuda memory issue. I will try that one. Are you able to train bert_base like spanbert_base?

Sushanta

sushantakpani commented 3 years ago

python run.py train_bert_base_ml0_d1 0 also gave the same result whereas python run.py train_spanbert_base_ml0_d1 0 progressed for training but halted due to cuda out of memory issue. RuntimeError: CUDA out of memory. Tried to allocate 630.00 MiB (GPU 5; 10.76 GiB total capacity; 8.44 GiB already allocated; 315.12 MiB free; 9.60 GiB reserved in total by PyTorch)

Hi, I solve these kinds of issues by changing some parameters in the experiments.conf file, in order to decrease the size of the model. For example, you can decrease the ffnn_size, max_segment_len.

Hi @AradAshrafi, Thanks for your response and tips to solve the Cuda memory issue. I will try that one. Are you able to train bert_base like spanbert_base?

Sushanta

Seems working now. I tried python run.py train_spanbert_base_ml0_d1 0 with the following values in the experiments.conf for spanbert_base.

spanbert_base = ${best}{ num_docs = 2802 bert_learning_rate = 2e-05 task_learning_rate = 0.0001 max_segment_len = 128 #384 ffnn_size = 1000 #3000 cluster_ffnn_size = 1000 #3000 max_training_sentences = 3 bert_tokenizer_name = bert-base-cased bert_pretrained_name_or_path = ${best.data_dir}/spanbert_base }

sushantakpani commented 3 years ago

@lxucs Please let me know how can I train bert_base?

lxucs commented 3 years ago

Hi @sushantakpani , you can have a config like this (similar to training spanbert_base):


train_bert_base_ml0_d1 = ${train_bert_base}{
  mention_loss_coef = 0
  coref_depth = 1
}
sushantakpani commented 3 years ago

Hi @lxucs

It seems this error was due to GPU memory issue. I have shifted to a higher memory GPU server and able to run the training. python run.py train_bert_base_ml0_d1 0

My configuration as follows

bert_base = ${best}{ num_docs = 2802 bert_learning_rate = 1e-05 task_learning_rate = 2e-4 max_segment_len = 128 ffnn_size =1000 #3000 cluster_ffnn_size =1000 #3000 max_training_sentences = 11 bert_tokenizer_name = bert-base-cased bert_pretrained_name_or_path = bert-base-cased }

train_bert_base = ${bert_base}{ }

train_bert_base_ml0_d1 = ${train_bert_base}{ mention_loss_coef = 0 coref_depth = 1 }

Hi @sushantakpani , you can have a config like this (similar to training spanbert_base):

train_bert_base_ml0_d1 = ${train_bert_base}{
  mention_loss_coef = 0
  coref_depth = 1
}
sushantakpani commented 3 years ago

In Experiment Section of the paper:

Note thatBERTandSpanBERTcompletely rely on only local decisions without any HOI. Particularly, +AA is equivalent to Joshi et al. (2020).

Please let me know to replicate Joshi 2020 work what should be the configuration.

Is this configuration fine: higher_order = attended_antecedent

train_spanbert_base_ml0_d1 = ${train_spanbert_base}{ mention_loss_coef = 0 coref_depth = 2 }

L-hongbin commented 3 years ago

In Experiment Section of the paper:

Note thatBERTandSpanBERTcompletely relyon only local decisions without any HOI. Particu-larly,+AAis equivalent to Joshi et al. (2020).

Please let me know to replicate Joshi 2020 work what should be the configuration.

Is this configuration fine: higher_order = attended_antecedent

train_spanbert_base_ml0_d1 = ${train_spanbert_base}{ mention_loss_coef = 0 coref_depth = 2 }

hey, how about your trianing result of bert_base? I have trained the model on bert_base with c2f, but only get a result about 67 F1, and the tensorflow version is about 73 F1.

yangjingyi commented 2 years ago

In Experiment Section of the paper:

Note thatBERTandSpanBERTcompletely rely on only local decisions without any HOI. Particularly, +AA is equivalent to Joshi et al. (2020).

Please let me know to replicate Joshi 2020 work what should be the configuration.

Is this configuration fine: higher_order = attended_antecedent

train_spanbert_base_ml0_d1 = ${train_spanbert_base}{ mention_loss_coef = 0 coref_depth = 2 }

Hi, Have you replicate Joshi et al. spanbert large results?

sushantakpani commented 2 years ago

In Experiment Section of the paper: Note thatBERTandSpanBERTcompletely relyon only local decisions without any HOI. Particu-larly,+AAis equivalent to Joshi et al. (2020). Please let me know to replicate Joshi 2020 work what should be the configuration. Is this configuration fine: higher_order = attended_antecedent train_spanbert_base_ml0_d1 = ${train_spanbert_base}{ mention_loss_coef = 0 coref_depth = 2 }

hey, how about your trianing result of bert_base? I have trained the model on bert_base with c2f, but only get a result about 67 F1, and the tensorflow version is about 73 F1.

For BERT-base I could achieve 73.3 F1