questions about word-based NER model experiments?

jiesutd / LatticeLSTM

Chinese NER using Lattice LSTM. Code for ACL 2018 paper.

1.8k stars 453 forks source link

questions about word-based NER model experiments? #70

Closed liuwei1206 closed 5 years ago

liuwei1206 commented 5 years ago

Hi Jie,

I am curious about how you conduct the word-based NER experiments. As you see, both Weibo and Resume do not have the word-based labeled training dataset, how can we train a word-based model?

I'd like to know if you use some methods to transform the character-based dataset to word-based dataset?

I am willing to hear from you soon!

Regards. Wei

jiesutd commented 5 years ago

You can find the detail in our paper: https://arxiv.org/pdf/1805.02023.pdf section 4.1/Segmentation

liuwei1206 commented 5 years ago

I do read the paper in detail. But the paper only refers to how words are obtained, without saying how the labels(word labels) are obtained.

So do you use some methods to assign a label to each word in weibo and resume dataset?

jiesutd commented 5 years ago

In the paper Section 4.1/Segmentation

For the OntoNotes and MSRA datasets, gold-standard segmentation is available in the training sections. For OntoNotes, gold segmentation is also available for the development and test sections. On the other hand, no segmentation is available for the MSRA test sections, nor the Weibo / resume datasets. As a result, OntoNotes is leveraged for studying oracle situations where gold segmentation is given. We use the neural word segmentor of Yang et al. (2017a) to automatically segment the development and test sets for word-based NER. In particular, for the OntoNotes and MSRA datasets, we train the segmentor using gold segmentation on their respective training sets. For Weibo and resume, we take the best model of Yang et al. (2017a) off the shelf, which is trained using CTB 6.0 (Xue et al., 2005).

I think this should be the answer.

liuwei1206 commented 5 years ago

Hi Jie, I am sorry to bother you.

But I really can not understand, how the word's label ground truth is generated? I mean the gold label of word obtained by auto-segmentation.

The 4.1 section do say how to get the words, but really do not mention the ground truth of Weibo and Resume.

jiesutd commented 5 years ago

There is no ground truth segmentation in Weibo and Resume because they are not annotated in segmentation.

liuwei1206 commented 5 years ago

yeah, I know. I mean the ground truth of NER label.

Without assigned label for auto-segmentation word, how can we train a word-based NER model for Weibo and Resume?

jiesutd commented 5 years ago

Oh, you mean how to assign the NER label to auto-segmented word sequence?

You can set a simple rule to do that. For example:

char: 我明天去清华大学参观 auto-seg: 我明天去清华大学参观 ner: 清华大学 You can assign the NER tag for auto-seg as: word-ner: O O O B-ORG E-ORG O

In very rare cases, the auto-segmentation result has a different boundary with gold NER. For example, auto-seg: 我明天去清华大学参观 Then you can just split the auto-segmentation word following the NER boundary. new auto-seg: 我明天去清华大学参观 word-ner: O O O B-ORG I-ORG E-ORG O O

liuwei1206 commented 5 years ago

yeah, I just mean this.

My terrible English.

Thank you very much, I just confused with the second cased.

Good night!