hankcs / HanLP

中文分词 词性标注 命名实体识别 依存句法分析 成分句法分析 语义依存分析 语义角色标注 指代消解 风格转换 语义相似度 新词发现 关键词短语提取 自动摘要 文本分类聚类 拼音简繁转换 自然语言处理
https://hanlp.hankcs.com/
Apache License 2.0
33.82k stars 10.12k forks source link

Get the index of a token or a ner for example in the input text #1623

Closed maky-hnou closed 3 years ago

maky-hnou commented 3 years ago

Describe the feature and the current behavior/state. I've been looking into hanlp source code and documentation to find a way to get the index of a token or a ner in the original input text. I was not able to find a solution to this problem (i.e I only get the index of a word in the tokens list).
Here is an example:

import hanlp
model = hanlp.load(hanlp.pretrained.mtl.UD_ONTONOTES_TOK_POS_LEM_FEA_NER_SRL_DEP_SDP_CON_XLMR_BASE, devices=0)
ner = model("My name is John Smith. I am 19 and a student in college.")
print(ner)

The output is:
{'tok': ['My', 'name', 'is', 'John', 'Smith', '.', 'I', 'am', '19', 'and', 'a', 'student', 'in', 'college', '.'], 'ner': [('John Smith', 'PERSON', 3, 5), ('19', 'DATE', 8, 9)], 'srl': [[('My name', 'ARG1', 0, 2), ('is', 'PRED', 2, 3), ('John Smith', 'ARG2', 3, 5)], [('I', 'ARG1', 6, 7), ('am', 'PRED', 7, 8), ('19 and a student in college', 'ARG2', 8, 14)]], 'sdp/dm': [[], [(1, 'poss'), (3, 'ARG1')], [(1, 'orphan')], [(1, 'orphan')], [(3, 'ARG2'), (4, 'compound')], [(1, 'orphan')], [(8, 'ARG1')], [], [(8, 'ARG2')], [(1, 'orphan')], [(1, 'orphan')], [(9, '_and_c'), (11, 'BV'), (13, 'ARG1')], [(1, 'orphan')], [(13, 'ARG2')], [(1, 'orphan')]], 'sdp/pas': [[], [(1, 'det_ARG1'), (3, 'verb_ARG1')], [(1, 'orphan')], [(1, 'orphan')], [(3, 'verb_ARG2'), (4, 'noun_ARG1')], [(1, 'orphan')], [(8, 'verb_ARG1')], [(6, 'conj_ARG2')], [(10, 'coord_ARG1')], [(8, 'verb_ARG2')], [(1, 'orphan')], [(10, 'coord_ARG2'), (11, 'det_ARG1'), (13, 'prep_ARG1')], [(1, 'orphan')], [(13, 'prep_ARG2')], [(1, 'orphan')]], 'sdp/psd': [[(2, 'APP')], [(3, 'ACT-arg')], [(6, 'CONJ.member')], [(5, 'NE')], [(3, 'PAT-arg')], [], [(8, 'ACT-arg')], [(6, 'CONJ.member'), (10, 'CONJ.member')], [(8, 'PAT-arg')], [(6, 'CONJ.member')], [(6, 'orphan')], [(8, 'PAT-arg'), (10, 'CONJ.member')], [(6, 'orphan')], [(12, 'LOC')], [(6, 'orphan')]], 'con': ['TOP', [['S', [['S', [['NP', [['PRON', ['My']], ['NOUN', ['name']]]], ['VP', [['AUX', ['is']], ['NP', [['PROPN', ['John']], ['PROPN', ['Smith']]]]]]]], ['PUNCT', ['.']], ['S', [['NP', [['PRON', ['I']]]], ['VP', [['AUX', ['am']], ['NP', [['NP', [['NUM', ['19']]]], ['CCONJ', ['and']], ['NP', [['NP', [['DET', ['a']], ['NOUN', ['student']]]], ['PP', [['ADP', ['in']], ['NP', [['NOUN', ['college']]]]]]]]]]]]]], ['PUNCT', ['.']]]]]], 'lem': ['my', 'name', 'be', 'John', 'Smith', '.', 'I', 'be', '19', 'and', 'a', 'student', 'in', 'college', '.'], 'pos': ['PRON', 'NOUN', 'AUX', 'PROPN', 'PROPN', 'PUNCT', 'PRON', 'AUX', 'NUM', 'CCONJ', 'DET', 'NOUN', 'ADP', 'NOUN', 'PUNCT'], 'fea': ['Number=Sing|Person=1|Poss=Yes|PronType=Prs', 'Number=Sing', 'Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin', 'Number=Sing', 'Number=Sing', '_', 'Case=Nom|Number=Sing|Person=1|PronType=Prs', 'Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin', 'NumType=Card', '_', 'Definite=Ind|PronType=Art', 'Number=Sing', '_', 'Number=Sing', '_'], 'dep': [(2, 'nmod:poss'), (4, 'nsubj'), (5, 'cop'), (0, 'root'), (4, 'flat'), (4, 'punct'), (9, 'nsubj'), (9, 'cop'), (4, 'parataxis'), (12, 'cc'), (12, 'det'), (9, 'conj'), (14, 'case'), (12, 'nmod'), (9, 'punct')]}

For example instead/in addition of getting ('John Smith', 'PERSON', 3, 5) is it possible to get ('John Smith', 'PERSON', 11, 21) where 11 and 21 are the start and end indexes of 'John Smith' in the original input text. Will this change the current api? How? It is not necessary to change the current API, it is possible to add it as an option. Who will benefit with this feature? Everyone who uses hanlp Are you willing to contribute it (Yes/No): No. System information

Any other info

hankcs commented 3 years ago

Hi, good question.

Long story short, NER is on token level so it can't output offsets on char level. You need to hack the tokenizer instead. It's possible but requires some tricks by modifying the outputs of tokenizer. The tokenizer is operating on subwords, and each subword is a span on char level. If you put a breakpoint here:

https://github.com/hankcs/HanLP/blob/68b87d44ca2cbec1fed3c701528d35568fa81d35/hanlp/components/tokenizers/transformer.py#L154

print(sub_tokens)
print(batch['token_subtoken_offsets'][0])
['My', 'name', 'is', 'John', 'Smith', '.', 'I', 'am', '19', 'and', 'a', 'student', 'in', 'college', '."']
[(0, 2), (3, 7), (8, 10), (11, 15), (16, 21), (21, 22), (23, 24), (25, 27), (28, 30), (31, 34), (35, 36), (37, 44), (45, 47), (48, 55), (55, 57)]

You will find the offset of each subword. So, you can caculate the start/end offset of each token based on this and write to the output dict by overriding this method:

https://github.com/hankcs/HanLP/blob/7229aea94ce3aac813b6713ece0de76a62d107b1/hanlp/components/mtl/tasks/__init__.py#L289

maky-hnou commented 3 years ago

That answers my question.
Thank you Dr. Han