here "X" used to represent "##eer","##soo" and so on!

kyzhouhzau / BERT-NER

Use Google's BERT for named entity recognition （CoNLL-2003 as the dataset）.

MIT License

1.24k stars 335 forks source link

here "X" used to represent "##eer","##soo" and so on! #62

Open fendouai opened 5 years ago

fendouai commented 5 years ago

what is "##eer","##soo"

FallakAsad commented 5 years ago

If the word does not exists in vocabulary it gets break down into several small words that exists in vocabulary of BERT by the tokenizer. For example, let say we have word 'Cats' in data and in Bert's vocabulary 'Cat' and '##s' exists but 'Cats' doesn't therefore word tokenizer will break 'Cats' to ['Cat', '##s']. This is how bert's handle out-of-vocabulary words. In this implementation of BERT-NER, all the (i.e '##s') sub words are assigned a label 'X'.

zwd13122889 commented 4 years ago

Hey, i want to know the data set.The first raw is word ,the fourth raw is the label, what's the second and third raw meaning? Another question is the output label_test.txt , its second and third raw are same, does it have another meaning ?

FallakAsad commented 4 years ago

In train.txt, dev.txt or test.txt have following type of rows: AT NNP B-NP O TOP NNP I-NP O

In these files second column indicates part-of-speech tags (e.g., 'JJ', 'NNP'), and third column chunk labels. Both of these columns are ignored when training the model so you can simply put anything in these columns.

In labels_test.txt, second column is expected label and third column is predicted label.

zwd13122889 commented 4 years ago

OK，I got it.How long will it take me to finish this script with a gpu?

FallakAsad commented 4 years ago

I haven't tried it on the dataset that is included in the repository so I can't tell.

zwd13122889 commented 4 years ago

OK. I run my own data. But i have some problem show in the picture: 微信截图_20191030151556

the left is author's data ,the right is mine

FallakAsad commented 4 years ago

It seems like code is unable to read your data. Does your train.txt file contains samples? Or can you paste some examples data here? My training files looks like:

I - - O am - - O with - - O : - - O exy- - person . - - O

This - - O : - - O abc - - person . - - O

Also, if you have entities other then the specified at line https://github.com/kyzhouhzau/BERT-NER/blob/master/BERT_NER.py#L227. Then you need to update this function

zwd13122889 commented 4 years ago

@FallakAsad Thank you.I have solved this problem.

zwd13122889 commented 4 years ago

@FallakAsad Now I want to change crf layer to lstm-crf layer? I dont know how to modify the code, can you give me some advice?