Get the index of a token or a ner for example in the input text

Describe the feature and the current behavior/state. I've been looking into hanlp source code and documentation to find a way to get the index of a token or a ner in the original input text. I was not able to find a solution to this problem (i.e I only get the index of a word in the tokens list).
Here is an example:

import hanlp
model = hanlp.load(hanlp.pretrained.mtl.UD_ONTONOTES_TOK_POS_LEM_FEA_NER_SRL_DEP_SDP_CON_XLMR_BASE, devices=0)
ner = model("My name is John Smith. I am 19 and a student in college.")
print(ner)

The output is:
{'tok': ['My', 'name', 'is', 'John', 'Smith', '.', 'I', 'am', '19', 'and', 'a', 'student', 'in', 'college', '.'], 'ner': [('John Smith', 'PERSON', 3, 5), ('19', 'DATE', 8, 9)], 'srl': [[('My name', 'ARG1', 0, 2), ('is', 'PRED', 2, 3), ('John Smith', 'ARG2', 3, 5)], [('I', 'ARG1', 6, 7), ('am', 'PRED', 7, 8), ('19 and a student in college', 'ARG2', 8, 14)]], 'sdp/dm': [[], [(1, 'poss'), (3, 'ARG1')], [(1, 'orphan')], [(1, 'orphan')], [(3, 'ARG2'), (4, 'compound')], [(1, 'orphan')], [(8, 'ARG1')], [], [(8, 'ARG2')], [(1, 'orphan')], [(1, 'orphan')], [(9, '_and_c'), (11, 'BV'), (13, 'ARG1')], [(1, 'orphan')], [(13, 'ARG2')], [(1, 'orphan')]], 'sdp/pas': [[], [(1, 'det_ARG1'), (3, 'verb_ARG1')], [(1, 'orphan')], [(1, 'orphan')], [(3, 'verb_ARG2'), (4, 'noun_ARG1')], [(1, 'orphan')], [(8, 'verb_ARG1')], [(6, 'conj_ARG2')], [(10, 'coord_ARG1')], [(8, 'verb_ARG2')], [(1, 'orphan')], [(10, 'coord_ARG2'), (11, 'det_ARG1'), (13, 'prep_ARG1')], [(1, 'orphan')], [(13, 'prep_ARG2')], [(1, 'orphan')]], 'sdp/psd': [[(2, 'APP')], [(3, 'ACT-arg')], [(6, 'CONJ.member')], [(5, 'NE')], [(3, 'PAT-arg')], [], [(8, 'ACT-arg')], [(6, 'CONJ.member'), (10, 'CONJ.member')], [(8, 'PAT-arg')], [(6, 'CONJ.member')], [(6, 'orphan')], [(8, 'PAT-arg'), (10, 'CONJ.member')], [(6, 'orphan')], [(12, 'LOC')], [(6, 'orphan')]], 'con': ['TOP', [['S', [['S', [['NP', [['PRON', ['My']], ['NOUN', ['name']]]], ['VP', [['AUX', ['is']], ['NP', [['PROPN', ['John']], ['PROPN', ['Smith']]]]]]]], ['PUNCT', ['.']], ['S', [['NP', [['PRON', ['I']]]], ['VP', [['AUX', ['am']], ['NP', [['NP', [['NUM', ['19']]]], ['CCONJ', ['and']], ['NP', [['NP', [['DET', ['a']], ['NOUN', ['student']]]], ['PP', [['ADP', ['in']], ['NP', [['NOUN', ['college']]]]]]]]]]]]]], ['PUNCT', ['.']]]]]], 'lem': ['my', 'name', 'be', 'John', 'Smith', '.', 'I', 'be', '19', 'and', 'a', 'student', 'in', 'college', '.'], 'pos': ['PRON', 'NOUN', 'AUX', 'PROPN', 'PROPN', 'PUNCT', 'PRON', 'AUX', 'NUM', 'CCONJ', 'DET', 'NOUN', 'ADP', 'NOUN', 'PUNCT'], 'fea': ['Number=Sing|Person=1|Poss=Yes|PronType=Prs', 'Number=Sing', 'Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin', 'Number=Sing', 'Number=Sing', '_', 'Case=Nom|Number=Sing|Person=1|PronType=Prs', 'Mood=Ind|Number=Sing|Person=1|Tense=Pres|VerbForm=Fin', 'NumType=Card', '_', 'Definite=Ind|PronType=Art', 'Number=Sing', '_', 'Number=Sing', '_'], 'dep': [(2, 'nmod:poss'), (4, 'nsubj'), (5, 'cop'), (0, 'root'), (4, 'flat'), (4, 'punct'), (9, 'nsubj'), (9, 'cop'), (4, 'parataxis'), (12, 'cc'), (12, 'det'), (9, 'conj'), (14, 'case'), (12, 'nmod'), (9, 'punct')]}

For example instead/in addition of getting ('John Smith', 'PERSON', 3, 5) is it possible to get ('John Smith', 'PERSON', 11, 21) where 11 and 21 are the start and end indexes of 'John Smith' in the original input text. Will this change the current api? How? It is not necessary to change the current API, it is possible to add it as an option. Who will benefit with this feature? Everyone who uses hanlp Are you willing to contribute it (Yes/No): No. System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04
Python version: 3.6
HanLP version: hanlp==2.1.0a12

Any other info

[x] I've carefully completed this form.

Hi, good question.

Long story short, NER is on token level so it can't output offsets on char level. You need to hack the tokenizer instead. It's possible but requires some tricks by modifying the outputs of tokenizer. The tokenizer is operating on subwords, and each subword is a span on char level. If you put a breakpoint here:

https://github.com/hankcs/HanLP/blob/68b87d44ca2cbec1fed3c701528d35568fa81d35/hanlp/components/tokenizers/transformer.py#L154

print(sub_tokens)
print(batch['token_subtoken_offsets'][0])
['My', 'name', 'is', 'John', 'Smith', '.', 'I', 'am', '19', 'and', 'a', 'student', 'in', 'college', '."']
[(0, 2), (3, 7), (8, 10), (11, 15), (16, 21), (21, 22), (23, 24), (25, 27), (28, 30), (31, 34), (35, 36), (37, 44), (45, 47), (48, 55), (55, 57)]

You will find the offset of each subword. So, you can caculate the start/end offset of each token based on this and write to the output dict by overriding this method:

https://github.com/hankcs/HanLP/blob/7229aea94ce3aac813b6713ece0de76a62d107b1/hanlp/components/mtl/tasks/__init__.py#L289

hankcs / HanLP

Get the index of a token or a ner for example in the input text #1623