Closed maky-hnou closed 3 years ago
Hi, good question.
Long story short, NER is on token level so it can't output offsets on char level. You need to hack the tokenizer instead. It's possible but requires some tricks by modifying the outputs of tokenizer. The tokenizer is operating on subwords, and each subword is a span on char level. If you put a breakpoint here:
print(sub_tokens)
print(batch['token_subtoken_offsets'][0])
['My', 'name', 'is', 'John', 'Smith', '.', 'I', 'am', '19', 'and', 'a', 'student', 'in', 'college', '."']
[(0, 2), (3, 7), (8, 10), (11, 15), (16, 21), (21, 22), (23, 24), (25, 27), (28, 30), (31, 34), (35, 36), (37, 44), (45, 47), (48, 55), (55, 57)]
You will find the offset of each subword. So, you can caculate the start/end offset of each token based on this and write to the output dict by overriding this method:
That answers my question.
Thank you Dr. Han
Describe the feature and the current behavior/state. I've been looking into hanlp source code and documentation to find a way to get the index of a token or a ner in the original input text. I was not able to find a solution to this problem (i.e I only get the index of a word in the tokens list).
Here is an example:
The output is:
{'tok': ['My', 'name', 'is', 'John', 'Smith', '.', 'I', 'am', '19', 'and', 'a', 'student', 'in', 'college', '.'], 'ner': [('John Smith', 'PERSON', 3, 5), ('19', 'DATE', 8, 9)], 'srl': [[('My name', 'ARG1', 0, 2), ('is', 'PRED', 2, 3), ('John Smith', 'ARG2', 3, 5)], [('I', 'ARG1', 6, 7), ('am', 'PRED', 7, 8), ('19 and a student in college', 'ARG2', 8, 14)]], 'sdp/dm': [[], [(1, 'poss'), (3, 'ARG1')], [(1, 'orphan')], [(1, 'orphan')], [(3, 'ARG2'), (4, 'compound')], [(1, 'orphan')], [(8, 'ARG1')], [], [(8, 'ARG2')], [(1, 'orphan')], [(1, 'orphan')], [(9, '_and_c'), (11, 'BV'), (13, 'ARG1')], [(1, 'orphan')], [(13, 'ARG2')], [(1, 'orphan')]], 'sdp/pas': [[], [(1, 'det_ARG1'), (3, 'verb_ARG1')], [(1, 'orphan')], [(1, 'orphan')], [(3, 'verb_ARG2'), (4, 'noun_ARG1')], [(1, 'orphan')], [(8, 'verb_ARG1')], [(6, 'conj_ARG2')], [(10, 'coord_ARG1')], [(8, 'verb_ARG2')], [(1, 'orphan')], [(10, 'coord_ARG2'), (11, 'det_ARG1'), (13, 'prep_ARG1')], [(1, 'orphan')], [(13, 'prep_ARG2')], [(1, 'orphan')]], 'sdp/psd': [[(2, 'APP')], [(3, 'ACT-arg')], [(6, 'CONJ.member')], [(5, 'NE')], [(3, 'PAT-arg')], [], [(8, 'ACT-arg')], [(6, 'CONJ.member'), (10, 'CONJ.member')], [(8, 'PAT-arg')], [(6, 'CONJ.member')], [(6, 'orphan')], [(8, 'PAT-arg'), (10, 'CONJ.member')], [(6, 'orphan')], [(12, 'LOC')], [(6, 'orphan')]], 'con': ['TOP', [['S', [['S', [['NP', [['PRON', ['My']], ['NOUN', ['name']]]], ['VP', [['AUX', ['is']], ['NP', [['PROPN', ['John']], ['PROPN', ['Smith']]]]]]]], ['PUNCT', ['.']], ['S', [['NP', [['PRON', ['I']]]], ['VP', [['AUX', ['am']], ['NP', [['NP', [['NUM', ['19']]]], ['CCONJ', ['and']], ['NP', [['NP', [['DET', ['a']], ['NOUN', ['student']]]], ['PP', [['ADP', ['in']], ['NP', [['NOUN', ['college']]]]]]]]]]]]]], ['PUNCT', ['.']]]]]], 'lem': ['my', 'name', 'be', 'John', 'Smith', '.', 'I', 'be', '19', 'and', 'a', 'student', 'in', 'college', '.'], 'pos': ['PRON', 'NOUN', 'AUX', 'PROPN', 'PROPN', 'PUNCT', 'PRON', 'AUX', 'NUM', 'CCONJ', 'DET', 'NOUN', 'ADP', 'NOUN', 'PUNCT'], 'fea': ['Number=Sing|Person=1|Poss=Yes|PronType=Prs', 'Number=Sing', 'Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin', 'Number=Sing', 'Number=Sing', '_', 'Case=Nom|Number=Sing|Person=1|PronType=Prs', 'Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin', 'NumType=Card', '_', 'Definite=Ind|PronType=Art', 'Number=Sing', '_', 'Number=Sing', '_'], 'dep': [(2, 'nmod:poss'), (4, 'nsubj'), (5, 'cop'), (0, 'root'), (4, 'flat'), (4, 'punct'), (9, 'nsubj'), (9, 'cop'), (4, 'parataxis'), (12, 'cc'), (12, 'det'), (9, 'conj'), (14, 'case'), (12, 'nmod'), (9, 'punct')]}
For example instead/in addition of getting
('John Smith', 'PERSON', 3, 5)
is it possible to get('John Smith', 'PERSON', 11, 21)
where 11 and 21 are the start and end indexes of 'John Smith' in the original input text. Will this change the current api? How? It is not necessary to change the current API, it is possible to add it as an option. Who will benefit with this feature? Everyone who uses hanlp Are you willing to contribute it (Yes/No): No. System informationhanlp==2.1.0a12
Any other info