mandarjoshi90 / coref

BERT for Coreference Resolution
Apache License 2.0
446 stars 94 forks source link

scripts don't work. #1

Closed fairy-of-9 closed 5 years ago

fairy-of-9 commented 5 years ago

issue#1

command : ./download_pretrained.sh bert-base error : HTTP request sent, awaiting response... 404 Not Found


issue#2

command : GPU=0 python train.py best error : Traceback (most recent call last): File "train.py", line 22, in model = util.get_model(config) File "/data/BERT-coref/coref/util.py", line 21, in get_model return independent.CorefModel(config) File "/data/BERT-coref/coref/independent.py", line 32, in init self.bert_config = modeling.BertConfig.from_json_file(config["bert_config_file"]) File "/home/fairy_of_9/anaconda3/envs/bert/lib/python3.6/site-packages/pyhocon/config_tree.py", line 366, in getitem val = self.get(item) File "/home/fairy_of_9/anaconda3/envs/bert/lib/python3.6/site-packages/pyhocon/config_tree.py", line 209, in get return self._get(ConfigTree.parse_key(key), 0, default) File "/home/fairy_of_9/anaconda3/envs/bert/lib/python3.6/site-packages/pyhocon/config_tree.py", line 151, in _get raise ConfigMissingException(u"No configuration setting found for key {key}".format(key='.'.join(key_path[:key_index + 1]))) pyhocon.exceptions.ConfigMissingException: 'No configuration setting found for key bert_config_file'


experiments.conf

best {
  # Edit this
  data_dir = data_set
  model_type = independent
  # Computation limits.
  max_top_antecedents = 50
  max_training_sentences = 5
  top_span_ratio = 0.4
  max_num_speakers = 20
  max_segment_len = 64 #256

  # Learning
  bert_learning_rate = 1e-5
  task_learning_rate = 2e-4
  num_docs = 2802

  # Model hyperparameters.
  dropout_rate = 0.3
  ffnn_size = 500 #1000
  ffnn_depth = 1
  num_epochs = 20
  feature_size = 20
  max_span_width = 30
  use_metadata = true
  use_features = true
  use_segment_distance = true
  model_heads = false #true
  coref_depth = 2
  coarse_to_fine = true
  fine_grained = true
  use_prior = true

  # Other.
  train_path = data_set/train.english.jsonlines
  eval_path = data_set/dev.english.jsonlines
  conll_eval_path = data_set/dev.english.v4_gold_conll
  single_example = true
  genres = ["bc", "bn", "mz", "nw", "pt", "tc", "wb"]
  eval_frequency = 1000
  report_frequency = 100
  log_root = logs
  adam_eps = 1e-6
  task_optimizer = adam
}

bert_base = ${best}{
  num_docs = 2802
  bert_learning_rate = 1e-05
  task_learning_rate = 0.0002
  max_segment_len = 128
  ffnn_size = 3000
  train_path = data_set/train.english.128.jsonlines
  eval_path = data_set/dev.english.128.jsonlines
  conll_eval_path = data_set/dev.english.v4_gold_conll
  max_training_sentences = 11
  bert_config_file = ${best.log_root}/bert_base/bert_config.json
  vocab_file = ${best.log_root}/bert_base/vocab.txt
  tf_checkpoint = ${best.log_root}/bert_base/model.max.ckpt
  init_checkpoint = ${best.log_root}/bert_base/model.max.ckpt
}
...

Is there anything I missed?

mandarjoshi90 commented 5 years ago

Sorry, I had some typos. Please do a git pull. /download_pretrained.sh downloads pretrained coreference models.

Do you want to finetune BERT/SpanBERT on OntoNotes? If so, you don't need to use ./download_pretrained.sh. Please see my response to Issue 2 before proceeding to train.py.

If you want to use the already finetuned models, you should evaluate.py or predict.py after ./download_pretrained.sh bert_base.

Issue 1: Use command : ./download_pretrained.sh bert_base I replaced the hyphen with a an underscore. This will download the BERT base model funetuned on OntoNotes, i.e., this is not the original BERT model.

Issue 2: Use command: GPU=0 python train.py train_bert_base

The argument to train.py is the name of the config. I added train_bert_base to experiments.conf which will finetune BERT-base on coref.

train.py assumes you have the original (not finetuned on OntoNotes) bert models in $data_dir. Please rerun setup_training.sh or better still just execute this

download_bert(){
  model=$1
  wget -P $data_dir https://storage.googleapis.com/bert_models/2018_10_18/$model.zip
  unzip $data_dir/$model.zip
  rm $data_dir/$model.zip
  mv $model $data_dir/
}

download_bert cased_L-12_H-768_A-12
download_bert cased_L-24_H-1024_A-16

If you want to run a model already finetuned on coref, please use GPU=0 evaluate.py bert_base. I'll add a link to the original SpanBERT models soon. Please let me know if it works. Thanks!

fairy-of-9 commented 5 years ago

./download_pretrained.sh is work!

but.. I did run ./setup_training.sh (data_dir=data, ontonotes_path=ontonote) >> 'ontonote' is empty. what is ontonotes_path..? ./download_pretrained.sh bert_base (data_dir=logs)

so It's my directory image image

and my experiments.conf

data_dir=data

best {
  # Edit this
  data_dir = data   << edited
  model_type = independent
  # Computation limits.
  max_top_antecedents = 50
  max_training_sentences = 5
  top_span_ratio = 0.4
  max_num_speakers = 20
  max_segment_len = 256

  # Learning
  bert_learning_rate = 1e-5
  task_learning_rate = 2e-4
  num_docs = 2802

  # Model hyperparameters.
  dropout_rate = 0.3
  ffnn_size = 1000
  ffnn_depth = 1
  num_epochs = 20
  feature_size = 20
  max_span_width = 30
  use_metadata = true
  use_features = true
  use_segment_distance = true
  model_heads = true
  #model_heads = false
  coref_depth = 2
  coarse_to_fine = true
  fine_grained = true
  use_prior = true

  # Other.
  train_path = train.english.jsonlines
  eval_path = dev.english.jsonlines
  conll_eval_path = dev.english.v4_gold_conll
  single_example = true
  genres = ["bc", "bn", "mz", "nw", "pt", "tc", "wb"]
  eval_frequency = 1000
  report_frequency = 100
  #log_root = ${data_dir}
  log_root = logs << edited
  adam_eps = 1e-6
  task_optimizer = adam
}

bert_base = ${best}{
  num_docs = 2802
  bert_learning_rate = 1e-05
  task_learning_rate = 0.0002
  max_segment_len = 128
  ffnn_size = 3000
  train_path = ${data_dir}/train.english.128.jsonlines
  eval_path = ${data_dir}/dev.english.128.jsonlines
  conll_eval_path = ${data_dir}/dev.english.v4_gold_conll
  max_training_sentences = 11
  bert_config_file = ${best.log_root}/bert_base/bert_config.json
  vocab_file = ${best.log_root}/bert_base/vocab.txt
  tf_checkpoint = ${best.log_root}/bert_base/model.max.ckpt
  init_checkpoint = ${best.log_root}/bert_base/model.max.ckpt
}

and used command : GPU=0 python evaluate.py bert_base

result

Use standard file APIs to check for files with this prefix.
Loaded 0 eval examples.
Predicted conll file: /tmp/tmpi6yrp0jo
Official result for muc
version: 8.01 /data/BERT-coref/coref/conll-2012/scorer/v8.01/lib/CorScorer.pm

====== TOTALS =======
Identification of Mentions: Recall: (0 / 0) 0%  Precision: (0 / 0) 0%   F1: 0%
--------------------------------------------------------------------------
Coreference: Recall: (0 / 0) 0% Precision: (0 / 0) 0%   F1: 0%
--------------------------------------------------------------------------

Official result for bcub
version: 8.01 /data/BERT-coref/coref/conll-2012/scorer/v8.01/lib/CorScorer.pm

====== TOTALS =======
Identification of Mentions: Recall: (0 / 0) 0%  Precision: (0 / 0) 0%   F1: 0%
--------------------------------------------------------------------------
Coreference: Recall: (0 / 0) 0% Precision: (0 / 0) 0%   F1: 0%
--------------------------------------------------------------------------

Official result for ceafe
version: 8.01 /data/BERT-coref/coref/conll-2012/scorer/v8.01/lib/CorScorer.pm

====== TOTALS =======
Identification of Mentions: Recall: (0 / 0) 0%  Precision: (0 / 0) 0%   F1: 0%
--------------------------------------------------------------------------
Coreference: Recall: (0 / 0) 0% Precision: (0 / 0) 0%   F1: 0%
--------------------------------------------------------------------------

Average F1 (conll): 0.00%
Average F1 (py): 0.00% on 0 docs
Average precision (py): 0.00%
Average recall (py): 0.00%

I think, eval examples not loaded.

mandarjoshi90 commented 5 years ago

Weird. But it looks like you have the right model. Could you please check ${data_dir}/train.english.128.jsonlines? Does that seem reasonable, or do you see a lot of UNKs? I suspect minimize.py did not run correctly in which it's probably a good idea to re-run it (see the last line of setup_training.sh).

Another thing to check would be ${data_dir}/dev.english.v4_gold_conll which is the value of conll_eval_path in experiments.conf.

Thanks!

fairy-of-9 commented 5 years ago

Sorry. every conll, jsonlines are empty...

./setup_training.sh (data_dir=data, ontonotes_path=ontonote) >> 'ontonote' is empty. what is ontonotes_path..?

I think ontonote path is cause of the error.


I edited my comment. Could you read it again?

mandarjoshi90 commented 5 years ago

Ah I see. The ontonotes_path points to the OntoNotes corpus. Unfortunately, I cannot provide it due to legal issues. Please see more information here -- https://catalog.ldc.upenn.edu/LDC2013T19

If you want to play with the model though, you can create your own examples and use predict.py. The general format is

{
  "clusters": [],
  "doc_key": "nw",
  "sentences": [["[CLS]", "This", "is", "the", "first", "sentence", ".", "This", "is", "the", "second", ".", "[SEP]"]],
  "speakers": [["spk1", "spk1", "spk1", "spk1", "spk1", "spk1", "spk1", "spk2", "spk2", "spk2", "spk2", "spk2", "spk2"]]
}

This is a dummy example. The value for the key should be BERT-tokenized. Please let me know if you have more questions. If not, I'll close the issue. Thanks!

henlein commented 5 years ago

Thank you for publishing the code. Is the correct general input format for predict.py like the above input or like in the readme? You also need the jsonkey: sentence_map. How is this input structured? Many thanks in advance.

fairy-of-9 commented 5 years ago

Thank you for your comment.

${data_dir}/*.jsonlines, ${data_dir}/*.v4_gold_conll were made successfully.

mandarjoshi90 commented 5 years ago

I've added cased_config_vocab/trial.jsonlines as a full example. The README has been updated to include a brief explanation of keys. I'm closing this issue since the OP has opened a different one. Thanks!

abhinandansrivastava commented 5 years ago

Hey @fairy-of-9 @mandarjoshi90 @henlein ,

Can you please share a file for me to run one demo after that i will be able to make the format that this repo requires. I am not able to find trial.jsonlines file in the cased_config_vocab folder.

Thanks in advance.

mandarjoshi90 commented 5 years ago

Might be a better idea to look at the example in the README. I've also restored the trial.jsonlines file.

abhinandansrivastava commented 5 years ago

Hey @mandarjoshi90 ,

Thanks for reply.

I tried making file in the format, still predict.py is still giving me error. ValueError: setting an array element with a sequence.

i ran below code: python predict.py bert_base cased_config_vocab/TestingFile.jsonlines cased_config_vocab/15.jsonlines

{"doc_key": "wb", "sentences": ["[CLS]", "who", "was", "jim", "henson", "?", "jim", "henson", "was", "a", "puppet", "##eer", ".", "[SEP]"], "speakers": ["[SPL]", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "-", "[SPL]"], "clusters": [[]], "sentence_map": [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 2], "subtoken_map": [0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 9, 10, 10]} {"doc_key": "wb", "sentences": ["[CLS]", "i", "am", "going", "to", "goa", "and", "love", "biology", ".", "[SEP]"], "speakers": ["[SPL]", "-", "-", "-", "-", "-", "-", "-", "-", "-", "[SPL]"], "clusters": [[]], "sentence_map": [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1], "subtoken_map": [0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 8]} TestingFile.zip