nlplab / nersuite

http://nersuite.nlplab.org/
Other
26 stars 12 forks source link

Explanation of a set of features used in NERSuite. #30

Open priancho opened 8 years ago

priancho commented 8 years ago

Although there is no documentation of the feature set used in NERSuite, you can check the following two source files for this purpose: nersuite/src/nersuite/FExtor.h nersuite/src/nersuite/FExtor.cpp

With the default window size [-2, 2], NERSuite uses: 1) word features: 1-1) character n-grams (n=2-4) of the current word. 1-2) raw word n-grams (n=1-2) within the window. 1-3) number normalized word n-grams (n=1-2) within the window. When there is a sequence of consecutive numbers within a string, this part is normalized into a single 0 (e.g., NF1234 -> NF0).

2) lemma features - same to 1-3), but use lemma instead of word.

3) orthographic features - boolean features such as: 3-1) a current word contains beginning capital letter, digits, only digits, alpha-numeric characters, only capital letters and digits, no lowercase letters, all capital letters, capital letter(s) which is not the first letter, two consecutive capital letters, a Greek word as a sub-string, period, hyphen, slash, opening square bracket, closing square bracket, opening round bracket, closing round bracket, colon, semi-colon, percentage symbol, apostrophe. 3-2) the length of the current word (boolean feature). 3-3) the length of the current word & all capitalized word (boolean feature).

4) POS features - POS n-grams (n=1-2) within the window.

5) lemma+POS features - Lemma+POS n-grams (n=1-2) within the window.

6) chunk features: 6-1) chunk type of a current word. 6-2) the last raw word of the chunk that a current word belongs to. 6-3) the last lemma of the chunk that a current word belongs to. 6-4) whether the word "the" exist in the left most position of the current chunk (boolean feature).

7) dictionary features: 7-1) unlexicalized n-gram of a dictionary matching result (n=1-2) within the window. 7-2) lexicalized n-gram of a dictionary matching result (n=1-2) within the window.

NERSuite uses only positive features for dictionary feature. For a dictionary dic1 having an entry "NF-kappa B" and input text "As a result, we can identify NF-kappa B in ...", dictionary features will be triggered for each token as follows: (feature notation is not same to the one in the source code)

As - (empty) a - (empty) result - (empty) , - (empty) we - (empty) can - "Dic[2]=dic1", "Dic[2]=dic1_NF", "Dic[1,2]=O/dic1", "Dic[1,2]=O_is/dic1_NF" identify - "Dic[1]=dic1", "Dic[1]=dic1_NF", "Dic[0,1]=O/dic1", "Dic[0,1]=O_is/dic1_NF", "Dic[1,2]=dic1/dic1", "Dic[1,2]=dic1NF/dic1-" ...

And some of these features (especially orthographic features) are redundant because of the default tokenization scheme.