kyzhouhzau / BERT-NER

Use Google's BERT for named entity recognition (CoNLL-2003 as the dataset).
MIT License
1.23k stars 335 forks source link

--max_seq_length=128 -> 150 #1

Closed dsindex closed 5 years ago

dsindex commented 5 years ago

hi kyzhouhzau~

thank you for this project :) there is a minor error which i'd like to report.

def convert_single_example(ex_index, example, label_list, max_seq_length, tokenizer):
...
    input_ids = tokenizer.convert_tokens_to_ids(ntokens)
    input_mask = [1] * len(input_ids)
    while len(input_ids) < max_seq_length:
        input_ids.append(0)
        input_mask.append(0)
        segment_ids.append(0)
        label_ids.append(0)
    print('length check', len(input_ids), max_seq_length)
    assert len(input_ids) == max_seq_length  <-- error
    assert len(input_mask) == max_seq_length
    assert len(segment_ids) == max_seq_length
    assert len(label_ids) == max_seq_length
...

tokenizer.convert_tokens_to_ids(ntokens) would generate longer list than max_seq_length when we are using --max_seq_length=128.

so, i ran with --max_seq_length=150. it was fine.

kyzhouhzau commented 5 years ago

@dsindex Thanks for your suggestions. Indeed, --max_seq_length=128 will have a better result. @dsindex

 if len(tokens) >= max_seq_length - 1:  
       tokens = tokens[0:(max_seq_length - 2)] 
       labels = labels[0:(max_seq_length - 2)] 

I found the problem. In this part, I did not trim the length of the labels. Besides, >= is necessary. i have update the code and add recall f-score evaluation. Thanks for your help.