ner_seq.py", line 146, in convert_examples_to_features assert len(label_ids) == max_seq_length AssertionError

tianke0711 commented 3 years ago

why I use my own data has error:

File "BERT-NER-Pytorch-master/processors/ner_seq.py", line 146, in convert_examples_to_features assert len(label_ids) == max_seq_length AssertionError

tianke0711 commented 3 years ago

有一些字符像‘’ 比如 x  x x 0 0 x 无法tokenize 这怎么处理

lonePatient commented 3 years ago

要么用【unused】替代，要么就直接【unk】

tianke0711 commented 3 years ago

@lonePatient 谢谢

jacksonjack001 commented 3 years ago

非可见字符替换为可见字符即可，我昨天刚遇到这个问题

lvjiujin commented 3 years ago

I add the following code to solve the problem,:

        if len(tokens) != len(label_ids):
            # when the example.text_a contains the special chars, using the tokenizer.tokenize to process,
            # it occurs the problem that the lengths of tokens is not equal to the label_ids.
            # here just ignore this special case.
            jump_count += 1
            logger.info("-> *** len(tokens) != len(label_ids)  ***  <-")
            logger.info(" ex_index = {}, tokens = {} ".format(ex_index, tokens))
            logger.info(" ex_index = {}, label_ids = {} ".format(ex_index, label_ids))
            continue

lvjiujin commented 3 years ago

非可见字符替换为可见字符即可，我昨天刚遇到这个问题

It's not a good way to replace one chars to another.

lvjiujin commented 3 years ago

要么用【unused】替代，要么就直接【unk】

you'd better not use '[UNK]', because you don't know the accurate position of the '[UNK]', if you must do this, maybe occurs the wrong move position. so you can adopt my approach to solve the problem.

lj976264709 commented 2 years ago

I add the following code to solve the problem,:

        if len(tokens) != len(label_ids):
            # when the example.text_a contains the special chars, using the tokenizer.tokenize to process,
            # it occurs the problem that the lengths of tokens is not equal to the label_ids.
            # here just ignore this special case.
            jump_count += 1
            logger.info("-> *** len(tokens) != len(label_ids)  ***  <-")
            logger.info(" ex_index = {}, tokens = {} ".format(ex_index, tokens))
            logger.info(" ex_index = {}, label_ids = {} ".format(ex_index, label_ids))
            continue

May I ask where should this code insert into ?

Kissingbymodi commented 2 years ago

又没人确切的解决了这个问题

zmz125 commented 1 year ago

找到原因了，数据里边有空格行，标签为0。 tokens里边不包括空格，label_ids里边多一个标签，长度不相等。解决方法： if len(tokens) != len(label_ids): logger.info("-> *** len(tokens) != len(label_ids) *** <-") logger.info(" ex_index = {}, tokens = {} ".format(ex_index, tokens)) logger.info(len(tokens)) logger.info(" ex_index = {}, label_ids = {} ".format(ex_index, label_ids)) logger.info(len(label_ids)) 发现错误数据，修改即可。

zmz125 commented 1 year ago

添加至141行（四个len上边）

if len(tokens) != len(label_ids):
logger.info("-> *** len(tokens) != len(label_ids)  ***  <-")
logger.info(" ex_index = {}, tokens = {} ".format(ex_index, tokens))
logger.info(len(tokens))
logger.info(" ex_index = {}, label_ids = {} ".format(ex_index, label_ids))
logger.info(len(label_ids))

gsq47 commented 1 year ago

添加至141行（四个len上边）

if len(tokens) != len(label_ids):
            logger.info("-> *** len(tokens) != len(label_ids)  ***  <-")
            logger.info(" ex_index = {}, tokens = {} ".format(ex_index, tokens))
            logger.info(len(tokens))
            logger.info(" ex_index = {}, label_ids = {} ".format(ex_index, label_ids))
            logger.info(len(label_ids))

请问这个输出的结果tokens长度小于label_ids的长度这个是什么原因呢

zmz125 commented 1 year ago

添加至141行（四个len上边）

if len(tokens) != len(label_ids):
            logger.info("-> *** len(tokens) != len(label_ids)  ***  <-")
            logger.info(" ex_index = {}, tokens = {} ".format(ex_index, tokens))
            logger.info(len(tokens))
            logger.info(" ex_index = {}, label_ids = {} ".format(ex_index, label_ids))
            logger.info(len(label_ids))

请问这个输出的结果tokens长度小于label_ids的长度这个是什么原因呢

你把每个字符和标签映射以后的对比一下，或者直接找到这条数据看看标注的对不对，你这里边31估计是标签O，跟get_labels里return的索引一致

gsq47 commented 1 year ago

我想请问一下，这个为什么会把标注标签识别出来呢格式都是文字\t标签

zmz125 commented 1 year ago

我想请问一下，这个为什么会把标注标签识别出来呢格式都是文字\t标签

格式需要统一，要么全是空格分隔，要么全是\t，在代码里修改分隔字符

gsq47 commented 1 year ago

我想请问一下，这个为什么会把标注标签识别出来呢格式都是文字\t标签

格式需要统一，要么全是空格分隔，要么全是\t，在代码里修改分隔字符

非常感谢！！！

h83671979 commented 1 year ago

添加至141行（四个len上边）
if len(tokens) != len(label_ids):
            logger.info("-> *** len(tokens) != len(label_ids)  ***  <-")
            logger.info(" ex_index = {}, tokens = {} ".format(ex_index, tokens))
            logger.info(len(tokens))
            logger.info(" ex_index = {}, label_ids = {} ".format(ex_index, label_ids))
            logger.info(len(label_ids))
请问这个输出的结果tokens长度小于label_ids的长度这个是什么原因呢
你把每个字符和标签映射以后的对比一下，或者直接找到这条数据看看标注的对不对，你这里边31估计是标签O，跟get_labels里return的索引一致

这个我没太想明白要怎么处理，我也是遇到了这样的情况，标注是正确的，但是它后面的标注全部是0，也就是说全部替换成X，而这个我理解为他要把每一个扩充为相同长度的语句，方便训练，但我又不知道他为什么要报这个错。

h83671979 commented 1 year ago

有一些字符像‘’ 比如 x  x x 0 0 x 无法tokenize 这怎么处理

这个倒还好，比如表情之类的评论可以直接删除，但我这边出现的情况是英文字符全部为[unk]该怎么办啊？

Violettttee commented 8 months ago

有一些字符像‘’ 比如 x  x x 0 0 x 无法tokenize 这怎么处理

这个倒还好，比如表情之类的评论可以直接删除，但我这边出现的情况是英文字符全部为[unk]该怎么办啊？

您好，您现在解决这个问题了吗？可能是这个模型是针对中文的，但是我现在不太清楚在哪解决英文字符训练的问题。

lonePatient commented 8 months ago

有一些字符像‘’ 比如 x  x x 0 0 x 无法tokenize 这怎么处理

这个倒还好，比如表情之类的评论可以直接删除，但我这边出现的情况是英文字符全部为[unk]该怎么办啊？

您好，您现在解决这个问题了吗？可能是这个模型是针对中文的，但是我现在不太清楚在哪解决英文字符训练的问题。

您好！这个主要是针对中文的，英文可以自己在tokenizer上进行修改下，或者你参考下另外个仓库代码torchblocks吧。我记得应该是支持的

lonePatient / BERT-NER-Pytorch

ner_seq.py", line 146, in convert_examples_to_features assert len(label_ids) == max_seq_length AssertionError #48