jina-ai / clip-as-service

🏄 Scalable embedding, reasoning, ranking for images and sentences with CLIP
https://clip-as-service.jina.ai
Other
12.43k stars 2.07k forks source link

how to maintain an original-to-tokenized alignment as in BERT with bert_as_service ? #513

Open AsmaZbt opened 4 years ago

AsmaZbt commented 4 years ago

Prerequisites

Please fill in by replacing [ ] with [x].

hello , i would like to maintain an original-to-tokenized alignment as in BERT : like here :

Input

orig_tokens = ["John", "Johanson", "'s", "house"] labels = ["NNP", "NNP", "POS", "NN"]

Output

bert_tokens = []

Token map will be an int -> int mapping between the orig_tokens index and

the bert_tokens index.

orig_to_tok_map = []

tokenizer = tokenization.FullTokenizer( vocab_file=vocab_file, do_lower_case=True)

bert_tokens.append("[CLS]") for orig_token in orig_tokens: orig_to_tok_map.append(len(bert_tokens)) bert_tokens.extend(tokenizer.tokenize(orig_token)) bert_tokens.append("[SEP]")

bert_tokens == ["[CLS]", "john", "johan", "##son", "'", "s", "house", "[SEP]"]

orig_to_tok_map == [1, 2, 4, 6]

i need to create that list orig_to_tok_map , is it possible with bert_as_service?

tamuhey commented 4 years ago

@AsmaZbt Here is a library to solve your problem. https://github.com/tamuhey/tokenizations/tree/master/python