Prerequisites

Please fill in by replacing [ ] with [x].

[ X] Are you running the latest bert-as-service?
[ X] Did you follow the installation and the usage instructions in README.md?
[ X] Did you check the FAQ list in README.md?
[ X] Did you perform a cursory search on existing issues?

hello , i would like to maintain an original-to-tokenized alignment as in BERT : like here :

Input

orig_tokens = ["John", "Johanson", "'s", "house"] labels = ["NNP", "NNP", "POS", "NN"]

Output

bert_tokens = []

Token map will be an int -> int mapping between the `orig_tokens` index and

the `bert_tokens` index.

orig_to_tok_map = []

tokenizer = tokenization.FullTokenizer( vocab_file=vocab_file, do_lower_case=True)

bert_tokens.append("[CLS]") for orig_token in orig_tokens: orig_to_tok_map.append(len(bert_tokens)) bert_tokens.extend(tokenizer.tokenize(orig_token)) bert_tokens.append("[SEP]")

bert_tokens == ["[CLS]", "john", "johan", "##son", "'", "s", "house", "[SEP]"]

orig_to_tok_map == [1, 2, 4, 6]

i need to create that list orig_to_tok_map , is it possible with bert_as_service?

jina-ai / clip-as-service

how to maintain an original-to-tokenized alignment as in BERT with bert_as_service ? #513

Input

Output

Token map will be an int -> int mapping between the `orig_tokens` index and

the `bert_tokens` index.

bert_tokens == ["[CLS]", "john", "johan", "##son", "'", "s", "house", "[SEP]"]

orig_to_tok_map == [1, 2, 4, 6]

jina-ai / clip-as-service

how to maintain an original-to-tokenized alignment as in BERT with bert_as_service ? #513

Input

Output

Token map will be an int -> int mapping between the orig_tokens index and

the bert_tokens index.

bert_tokens == ["[CLS]", "john", "johan", "##son", "'", "s", "house", "[SEP]"]

orig_to_tok_map == [1, 2, 4, 6]

Token map will be an int -> int mapping between the `orig_tokens` index and

the `bert_tokens` index.