john-hewitt / structural-probes

Codebase for testing whether hidden states of neural networks encode discrete structures.
Other
381 stars 77 forks source link

Error during new experiment #9

Closed nlpirate closed 4 years ago

nlpirate commented 4 years ago

I tried to duplicate the experiment using a different training data, but run_experiment.py reports the following error:

Constructing new results directory at /content/drive/My Drive/structural-probes/example/results_it/BERT-disk-parse-depth-2020-2-10-9-22-7-778090/
Loading BERT Pretrained Embeddings from /content/drive/My Drive/structural-probes/scripts/rawbert_12layers_train.hdf5; using layer 12
The pre-trained model you are loading is a cased model but you have not set `do_lower_case` to False. We are setting `do_lower_case=False` for you but you may want to check this behavior.
100% 213450/213450 [00:00<00:00, 1089299.05B/s]
Using BERT-base-cased tokenizer to align embeddings with PTB tokens
[aligning embeddings]:   0% 0/13121 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "run_experiment.py", line 242, in <module>
    execute_experiment(yaml_args, train_probe=cli_args.train_probe, report_results=cli_args.report_results)
  File "run_experiment.py", line 170, in execute_experiment
    expt_dataset = dataset_class(args, task)
  File "/content/drive/My Drive/structural-probes/structural-probes/data.py", line 34, in __init__
    self.train_obs, self.dev_obs, self.test_obs = self.read_from_disk()
  File "/content/drive/My Drive/structural-probes/structural-probes/data.py", line 65, in read_from_disk
    train_observations = self.optionally_add_embeddings(train_observations, train_embeddings_path)
  File "/content/drive/My Drive/structural-probes/structural-probes/data.py", line 407, in optionally_add_embeddings
    embeddings = self.generate_subword_embeddings_from_hdf5(observations, pretrained_embeddings_path, layer_index)
  File "/content/drive/My Drive/structural-probes/structural-probes/data.py", line 393, in generate_subword_embeddings_from_hdf5
    single_layer_features = feature_stack[elmo_layer]
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "/usr/local/lib/python3.6/dist-packages/h5py/_hl/dataset.py", line 476, in __getitem__
    selection = sel.select(self.shape, args, dsid=self.id)
  File "/usr/local/lib/python3.6/dist-packages/h5py/_hl/selections.py", line 94, in select
    sel[args]
  File "/usr/local/lib/python3.6/dist-packages/h5py/_hl/selections.py", line 261, in __getitem__
    start, count, step, scalar = _handle_simple(self.shape,args)
  File "/usr/local/lib/python3.6/dist-packages/h5py/_hl/selections.py", line 451, in _handle_simple
    x,y,z = _translate_int(int(arg), length)
  File "/usr/local/lib/python3.6/dist-packages/h5py/_hl/selections.py", line 471, in _translate_int
    raise ValueError("Index (%s) out of range (0-%s)" % (exp, length-1))
ValueError: Index (12) out of range (0-11)

Is it an error related to the training data ?

john-hewitt commented 4 years ago

Hiya,

This is a problem with the example configs; they shouldn't include layer 12 in the 12-layer BERT model, because the BERT vec loading code in this codebase actually 0-indexes layers (so the 12th layer is index 11). Using model_layer: 11 should work.

nlpirate commented 4 years ago

thanks a lot! I changed the number of hidden_layer to 11 in .yam config file but now I got the following error

Constructing new results directory at /content/drive/My Drive/structural-probes/example/results_it/BERT-disk-parse-depth-2020-2-11-9-22-57-151641/
Loading BERT Pretrained Embeddings from /content/drive/My Drive/structural-probes/scripts/rawbert_12layers_train.hdf5; using layer 11
The pre-trained model you are loading is a cased model but you have not set `do_lower_case` to False. We are setting `do_lower_case=False` for you but you may want to check this behavior.
Using BERT-base-cased tokenizer to align embeddings with PTB tokens
[aligning embeddings]:   0% 0/13121 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "run_experiment.py", line 242, in <module>
    execute_experiment(yaml_args, train_probe=cli_args.train_probe, report_results=cli_args.report_results)
  File "run_experiment.py", line 170, in execute_experiment
    expt_dataset = dataset_class(args, task)
  File "/content/drive/My Drive/structural-probes/structural-probes/data.py", line 34, in __init__
    self.train_obs, self.dev_obs, self.test_obs = self.read_from_disk()
  File "/content/drive/My Drive/structural-probes/structural-probes/data.py", line 65, in read_from_disk
    train_observations = self.optionally_add_embeddings(train_observations, train_embeddings_path)
  File "/content/drive/My Drive/structural-probes/structural-probes/data.py", line 407, in optionally_add_embeddings
    embeddings = self.generate_subword_embeddings_from_hdf5(observations, pretrained_embeddings_path, layer_index)
  File "/content/drive/My Drive/structural-probes/structural-probes/data.py", line 397, in generate_subword_embeddings_from_hdf5
    assert single_layer_features.shape[0] == len(tokenized_sent)
AssertionError
john-hewitt commented 4 years ago

It's possible that this is an error where the sequence length (after subword tokenization) is longer than BERT's 512-token maximum, so when making the vectors, maybe it was truncated, and now when you're trying to load them it doesn't match the real length. Can you print single_layer_features.shape[0] and len(tokenized_sent)?

john-hewitt commented 4 years ago

Closing; feel free to re-open if issues persist