Open Awkwafina opened 4 years ago
Hi! I think the issue is with this line:
type: token #{token,subword}
which should be
type: subword #{token,subword}
since BERT uses subword tokenization (and hence the subwords need to be mapped to corpus tokens.)
This is an annoying aspect of the config structure as of now, since you might expect that the BERT-disk
flag alone would specify this, but alas.
Hi,
still facing this issue. maybe there is something wrong within hdf5 file?
here, assert single_layer_features.shape[0]=68
and len(tokenized_sent)=74
68 74 [aligning embeddings]: 25% 3173/12543 [00:09<00:27, 339.59it/s] Traceback (most recent call last): File "/content/structural-probes/structural-probes/run_experiment.py", line 242, in <module> execute_experiment(yaml_args, train_probe=cli_args.train_probe, report_results=cli_args.report_results) File "/content/structural-probes/structural-probes/run_experiment.py", line 170, in execute_experiment expt_dataset = dataset_class(args, task) File "/content/structural-probes/structural-probes/data.py", line 34, in __init__ self.train_obs, self.dev_obs, self.test_obs = self.read_from_disk() File "/content/structural-probes/structural-probes/data.py", line 65, in read_from_disk train_observations = self.optionally_add_embeddings(train_observations, train_embeddings_path) File "/content/structural-probes/structural-probes/data.py", line 408, in optionally_add_embeddings embeddings = self.generate_subword_embeddings_from_hdf5(observations, pretrained_embeddings_path, layer_index) File "/content/structural-probes/structural-probes/data.py", line 398, in generate_subword_embeddings_from_hdf5 assert single_layer_features.shape[0] == len(tokenized_sent) AssertionError
Or maybe I need to disable assertions to proceed?
Was this ever resolved? I'm experiencing the same issue.
Not sure, but it seems like the issue is the process by which vectors are written to disk (which may happen independently of this codebase) and the tokenization performed when loading text from disk are leading to differing numbers of tokens in the sequence. This could be because different tokenizers are used, or because the data isn't ordered the same way, or some preprocessing thing; I'm not 100% sure.
Also, added uncertainty: I'm not sure how the huggingface transformers
module tokenizers API has changed since I wrote this code, back when it was still pytorch-pretrained-BERT
, not transformers
.
In case someone finds it useful, I wrote a version compatible with the new transformers
library.
I posted the main changes in the issue #13
I am trying to run an experiment on a new dataset. Followed your instructions, but everytime when I run this code, 'AssertionError' appears. I don't know whether it is an issue or not, nevertheless, could you check my yaml file? I am working in google colab. bert_exam (my yaml file) is located here: https://github.com/Awkwafina/files/blob/master/bert_exam