do_bert.sh error in bert_embed.py

shamgane commented 2 years ago

I have been trying to run the mBERT extraction script for the dataset : ca/head_first with bert-base-multilingual-cased. I am faced with the following error trace :

Using bert-base-multilingual-cased for data/sent_graphs/ca/head_first/train.conllu Saving to data/sent_graphs/ca/head_first/train_bert.hdf5 Embedding... 0%| | 0/1173 [00:00<?, ?it/s] Traceback (most recent call last): File "bert_embed.py", line 135, in reps, sids = ee(model, args.indata) File "bert_embed.py", line 108, in ee assert len(sent.split()) == len(ave_reps) AssertionError Using bert-base-multilingual-cased for data/sent_graphs/ca/head_first/dev.conllu Saving to data/sent_graphs/ca/head_first/dev_bert.hdf5 Embedding... 0%| | 0/168 [00:00<?, ?it/s] Traceback (most recent call last): File "bert_embed.py", line 135, in reps, sids = ee(model, args.indata) File "bert_embed.py", line 108, in ee assert len(sent.split()) == len(ave_reps) AssertionError Using bert-base-multilingual-cased for data/sent_graphs/ca/head_first/test.conllu Saving to data/sent_graphs/ca/head_first/test_bert.hdf5 Embedding... 0%| | 0/336 [00:00<?, ?it/s] Traceback (most recent call last): File "bert_embed.py", line 135, in reps, sids = ee(model, args.indata) File "bert_embed.py", line 108, in ee assert len(sent.split()) == len(ave_reps) AssertionError

Seems like the output from average_reps function in bert_embed.py is giving an empty output [] for the data : 'Bona ubicació .' When it reaches the assert statement, this output length is clearly not equal to the length of the number of tokens in the sentence. This was an example that I illustrated to explain the problem. Would really appreciate if you could guide me on how to fix this.

Drdajie commented 2 years ago

Where is the do_bert.sh? I couldn't find it.

jerbarnes commented 2 years ago

It's been changed and now is called do_embedding.sh

jerbarnes / sentiment_graphs

do_bert.sh error in bert_embed.py #4