ratsgo / ratsnlp

tools for Natural Language Processing
MIT License
78 stars 32 forks source link

ko-BERT vocabulary 추출 #2

Open ratsgo opened 4 years ago

ratsgo commented 4 years ago

개요

바이너리 형태의 Ko-BERT vocabulary를 huggingface 모듈이 읽을 수 있는 text 형태의 vocabulary로 추출한다

ratsgo commented 4 years ago

code1

첫번째 방법

pip install mxnet gluonnlp sentencepiece
import gluonnlp as nlp
vocab = nlp.vocab.BERTVocab.from_sentencepiece("/Users/david/Downloads/spiece", padding_token="[PAD]")
with open("vocab.txt", "w", encoding="utf-8") as f:
    for k, v in vocab.token_to_idx.items():
        if k[0] == '▁':
            k = k.replace('▁', '')
        elif k in ["[UNK]", "[CLS]", "[SEP]", "[MASK]", "[PAD]"]:
            pass
        else:
            k = '##' + k
        f.writelines(k + "\n")
ratsgo commented 4 years ago

code2

두번째 방법

from gluonnlp.data import SentencepieceTokenizer
sp = SentencepieceTokenizer("spiece")
with open("vocab2.txt", "w", encoding="utf-8") as f:
    for el in sp.tokens:
            if el[0] == '▁':
                el = el.replace('▁', '')
            elif el in ["[UNK]", "[CLS]", "[SEP]", "[MASK]", "[PAD]"]:
                pass
            else:
                el = '##' + el