Open AsmaZbt opened 4 years ago
Prerequisites
Please fill in by replacing [ ] with [x].
[ ]
[x]
bert-as-service
README.md
hello , i would like to maintain an original-to-tokenized alignment as in BERT : like here :
orig_tokens = ["John", "Johanson", "'s", "house"] labels = ["NNP", "NNP", "POS", "NN"]
bert_tokens = []
orig_tokens
bert_tokens
orig_to_tok_map = []
tokenizer = tokenization.FullTokenizer( vocab_file=vocab_file, do_lower_case=True)
bert_tokens.append("[CLS]") for orig_token in orig_tokens: orig_to_tok_map.append(len(bert_tokens)) bert_tokens.extend(tokenizer.tokenize(orig_token)) bert_tokens.append("[SEP]")
i need to create that list orig_to_tok_map , is it possible with bert_as_service?
@AsmaZbt Here is a library to solve your problem. https://github.com/tamuhey/tokenizations/tree/master/python
Prerequisites
bert-as-service
?README.md
?README.md
?hello , i would like to maintain an original-to-tokenized alignment as in BERT : like here :
Input
orig_tokens = ["John", "Johanson", "'s", "house"] labels = ["NNP", "NNP", "POS", "NN"]
Output
bert_tokens = []
Token map will be an int -> int mapping between the
orig_tokens
index andthe
bert_tokens
index.orig_to_tok_map = []
tokenizer = tokenization.FullTokenizer( vocab_file=vocab_file, do_lower_case=True)
bert_tokens.append("[CLS]") for orig_token in orig_tokens: orig_to_tok_map.append(len(bert_tokens)) bert_tokens.extend(tokenizer.tokenize(orig_token)) bert_tokens.append("[SEP]")
bert_tokens == ["[CLS]", "john", "johan", "##son", "'", "s", "house", "[SEP]"]
orig_to_tok_map == [1, 2, 4, 6]
i need to create that list orig_to_tok_map , is it possible with bert_as_service?