AssertionError appears when trying to align embeddings

Awkwafina commented 4 years ago

I am trying to run an experiment on a new dataset. Followed your instructions, but everytime when I run this code, 'AssertionError' appears. I don't know whether it is an issue or not, nevertheless, could you check my yaml file? I am working in google colab. bert_exam (my yaml file) is located here: https://github.com/Awkwafina/files/blob/master/bert_exam bert_train

john-hewitt commented 4 years ago

Hi! I think the issue is with this line:

    type: token #{token,subword}

which should be

    type: subword #{token,subword}

since BERT uses subword tokenization (and hence the subwords need to be mapped to corpus tokens.)

This is an annoying aspect of the config structure as of now, since you might expect that the BERT-disk flag alone would specify this, but alas.

Awkwafina commented 4 years ago

Hi, still facing this issue. maybe there is something wrong within hdf5 file? here, assert single_layer_features.shape[0]=68 and len(tokenized_sent)=74

68 74 [aligning embeddings]: 25% 3173/12543 [00:09<00:27, 339.59it/s] Traceback (most recent call last): File "/content/structural-probes/structural-probes/run_experiment.py", line 242, in <module> execute_experiment(yaml_args, train_probe=cli_args.train_probe, report_results=cli_args.report_results) File "/content/structural-probes/structural-probes/run_experiment.py", line 170, in execute_experiment expt_dataset = dataset_class(args, task) File "/content/structural-probes/structural-probes/data.py", line 34, in __init__ self.train_obs, self.dev_obs, self.test_obs = self.read_from_disk() File "/content/structural-probes/structural-probes/data.py", line 65, in read_from_disk train_observations = self.optionally_add_embeddings(train_observations, train_embeddings_path) File "/content/structural-probes/structural-probes/data.py", line 408, in optionally_add_embeddings embeddings = self.generate_subword_embeddings_from_hdf5(observations, pretrained_embeddings_path, layer_index) File "/content/structural-probes/structural-probes/data.py", line 398, in generate_subword_embeddings_from_hdf5 assert single_layer_features.shape[0] == len(tokenized_sent) AssertionError

Awkwafina commented 4 years ago

Or maybe I need to disable assertions to proceed?

joebartusek commented 3 years ago

Was this ever resolved? I'm experiencing the same issue.

john-hewitt commented 3 years ago

Not sure, but it seems like the issue is the process by which vectors are written to disk (which may happen independently of this codebase) and the tokenization performed when loading text from disk are leading to differing numbers of tokens in the sequence. This could be because different tokenizers are used, or because the data isn't ordered the same way, or some preprocessing thing; I'm not 100% sure.

Also, added uncertainty: I'm not sure how the huggingface transformers module tokenizers API has changed since I wrote this code, back when it was still pytorch-pretrained-BERT, not transformers.

caspillaga commented 2 years ago

In case someone finds it useful, I wrote a version compatible with the new transformers library. I posted the main changes in the issue #13

john-hewitt / structural-probes

AssertionError appears when trying to align embeddings #12