Closed liuwei1206 closed 5 years ago
You can find the detail in our paper: https://arxiv.org/pdf/1805.02023.pdf section 4.1/Segmentation
I do read the paper in detail. But the paper only refers to how words are obtained, without saying how the labels(word labels) are obtained.
So do you use some methods to assign a label to each word in weibo and resume dataset?
In the paper Section 4.1/Segmentation
For the OntoNotes and MSRA datasets, gold-standard segmentation is available in the training sections. For OntoNotes, gold segmentation is also available for the development and test sections. On the other hand, no segmentation is available for the MSRA test sections, nor the Weibo / resume datasets. As a result, OntoNotes is leveraged for studying oracle situations where gold segmentation is given. We use the neural word segmentor of Yang et al. (2017a) to automatically segment the development and test sets for word-based NER. In particular, for the OntoNotes and MSRA datasets, we train the segmentor using gold segmentation on their respective training sets. For Weibo and resume, we take the best model of Yang et al. (2017a) off the shelf, which is trained using CTB 6.0 (Xue et al., 2005).
I think this should be the answer.
Hi Jie, I am sorry to bother you.
But I really can not understand, how the word's label ground truth is generated? I mean the gold label of word obtained by auto-segmentation.
The 4.1 section do say how to get the words, but really do not mention the ground truth of Weibo and Resume.
There is no ground truth segmentation in Weibo and Resume because they are not annotated in segmentation.
yeah, I know. I mean the ground truth of NER label.
Without assigned label for auto-segmentation word, how can we train a word-based NER model for Weibo and Resume?
Oh, you mean how to assign the NER label to auto-segmented word sequence?
You can set a simple rule to do that. For example:
char: 我明天去清华大学参观
auto-seg: 我 明天 去 清华 大学 参观
ner: 清华大学
You can assign the NER tag for auto-seg as:
word-ner: O O O B-ORG E-ORG O
In very rare cases, the auto-segmentation result has a different boundary with gold NER. For example,
auto-seg: 我 明天 去 清华 大 学参 观
Then you can just split the auto-segmentation word following the NER boundary.
new auto-seg: 我 明天 去 清华 大 学 参 观
word-ner: O O O B-ORG I-ORG E-ORG O O
yeah, I just mean this.
My terrible English.
Thank you very much, I just confused with the second cased.
Good night!
Hi Jie,
I am curious about how you conduct the word-based NER experiments. As you see, both Weibo and Resume do not have the word-based labeled training dataset, how can we train a word-based model?
I'd like to know if you use some methods to transform the character-based dataset to word-based dataset?
I am willing to hear from you soon!
Regards. Wei