monologg / KoELECTRA-Pipeline

Transformers Pipeline with KoELECTRA
Apache License 2.0
40 stars 7 forks source link

Finetune POS #2

Open sachaarbonel opened 4 years ago

sachaarbonel commented 4 years ago

Hi @monologg, thank's for your great work! I was trying to play around with your model on huggingface but I got this error Can't load config for 'monologg/koelectra-base-finetuned-naver-ner'. Make sure that: - 'monologg/koelectra-base-finetuned-naver-ner' is a correct model identifier listed on 'https://huggingface.co/models' - or 'monologg/koelectra-base-finetuned-naver-ner' is the correct path to a directory containing a config.json file.

Also, I wanted to know if you were willing to collaborate on a finetuned pos model? My understanding is that we need a conllu dataset such as UD_Korean-GSD and to clean it up to fit the format used by mrm8488 in his notebook I started working on a tool to clean up such datasets but I'm not sure the project is going into the right directions (I'm open to suggestions).

monologg commented 4 years ago

Hi @monologg, thank's for your great work! I was trying to play around with your model on huggingface but I got this error Can't load config for 'monologg/koelectra-base-finetuned-naver-ner'. Make sure that: - 'monologg/koelectra-base-finetuned-naver-ner' is a correct model identifier listed on 'https://huggingface.co/models' - or 'monologg/koelectra-base-finetuned-naver-ner' is the correct path to a directory containing a config.json file.

Hi:)

Can't you show the code that you run? (Also the version of transformers library)

I've tried to reproduce the issue but I can't:(

Below is the code and console log I've got.

I've tried on transformers==2.9.0 and transformers==3.0.2.

from transformers import ElectraTokenizer, ElectraForTokenClassification
from ner_pipeline import NerPipeline
from pprint import pprint

tokenizer = ElectraTokenizer.from_pretrained("monologg/koelectra-base-finetuned-naver-ner")
model = ElectraForTokenClassification.from_pretrained("monologg/koelectra-base-finetuned-naver-ner")

ner = NerPipeline(model=model,
                  tokenizer=tokenizer,
                  ignore_labels=[],
                  ignore_special_tokens=True)

texts = [
    "문재인 대통령은 28일 서울 코엑스에서 열린 ‘데뷰 (Deview) 2019’ 행사에 참석해 젊은 개발자들을 격려하면서 우리 정부의 인공지능 기본구상을 내놓았다. 출처 : 미디어오늘 (http://www.mediatoday.co.kr)",
    "2017년 장점마을 문제가 본격적으로 이슈가 될 무렵 임 의원은 장점마을 민관협의회 위원들과 여러 차례 마을과 금강농산을 찾아갔다.",
    "2009년 7월 FC서울을 떠나 잉글랜드 프리미어리그 볼턴 원더러스로 이적한 이청용은 크리스탈 팰리스와 독일 분데스리가2 VfL 보훔을 거쳐 지난 3월 K리그로 컴백했다. 행선지는 서울이 아닌 울산이었다"
]

pprint(ner(texts))
2020-07-31 16:58:05.421736: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory
2020-07-31 16:58:05.421798: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory
2020-07-31 16:58:05.421803: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
Downloading: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 279k/279k [00:00<00:00, 334kB/s]
Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 51.0/51.0 [00:00<00:00, 37.9kB/s]
Downloading: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.32k/2.32k [00:00<00:00, 1.19MB/s]
Downloading: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 441M/441M [00:33<00:00, 13.1MB/s]
[[{'entity': 'PER-B', 'score': 0.9999827742576599, 'word': '문재인'},
  {'entity': 'CVL-B', 'score': 0.999975323677063, 'word': '대통령은'},
  {'entity': 'DAT-B', 'score': 0.9999777674674988, 'word': '28일'},
  {'entity': 'LOC-B', 'score': 0.9998270273208618, 'word': '서울'},
  {'entity': 'AFW-B', 'score': 0.9999445676803589, 'word': '코엑스에서'},
  {'entity': 'O', 'score': 0.9999942183494568, 'word': '열린'},
  {'entity': 'EVT-B', 'score': 0.999899685382843, 'word': '‘데뷰'},
  {'entity': 'EVT-I', 'score': 0.9832423329353333, 'word': '(Deview)'},
  {'entity': 'EVT-I', 'score': 0.9992077946662903, 'word': '2019’'},
  {'entity': 'O', 'score': 0.9999943375587463, 'word': '행사에'},
  {'entity': 'O', 'score': 0.9999971985816956, 'word': '참석해'},
  {'entity': 'O', 'score': 0.9999986886978149, 'word': '젊은'},
  {'entity': 'CVL-B', 'score': 0.999980092048645, 'word': '개발자들을'},
  {'entity': 'O', 'score': 0.9999975562095642, 'word': '격려하면서'},
  {'entity': 'O', 'score': 0.9902618527412415, 'word': '우리'},
  {'entity': 'O', 'score': 0.9999471306800842, 'word': '정부의'},
  {'entity': 'TRM-B', 'score': 0.999969482421875, 'word': '인공지능'},
  {'entity': 'O', 'score': 0.999995768070221, 'word': '기본구상을'},
  {'entity': 'O', 'score': 0.9999942779541016, 'word': '내놓았다.'},
  {'entity': 'O', 'score': 0.9999710321426392, 'word': '출처'},
  {'entity': 'O', 'score': 0.9999939203262329, 'word': ':'},
  {'entity': 'ORG-B', 'score': 0.9999638199806213, 'word': '미디어오늘'},
  {'entity': 'TRM-B',
   'score': 0.8900758624076843,
   'word': '(http://www.mediatoday.co.kr)'}],
 [{'entity': 'DAT-B', 'score': 0.9966297745704651, 'word': '2017년'},
  {'entity': 'LOC-B', 'score': 0.8474875092506409, 'word': '장점마을'},
  {'entity': 'O', 'score': 0.9999969601631165, 'word': '문제가'},
  {'entity': 'O', 'score': 0.9999985098838806, 'word': '본격적으로'},
  {'entity': 'O', 'score': 0.9999987483024597, 'word': '이슈가'},
  {'entity': 'O', 'score': 0.9999971985816956, 'word': '될'},
  {'entity': 'O', 'score': 0.9999962449073792, 'word': '무렵'},
  {'entity': 'PER-B', 'score': 0.9948878288269043, 'word': '임'},
  {'entity': 'CVL-B', 'score': 0.9999330043792725, 'word': '의원은'},
  {'entity': 'O', 'score': 0.9958304166793823, 'word': '장점마을'},
  {'entity': 'ORG-B', 'score': 0.9879381656646729, 'word': '민관협의회'},
  {'entity': 'CVL-B', 'score': 0.9972733855247498, 'word': '위원들과'},
  {'entity': 'O', 'score': 0.9999987483024597, 'word': '여러'},
  {'entity': 'O', 'score': 0.9999983906745911, 'word': '차례'},
  {'entity': 'O', 'score': 0.9999876618385315, 'word': '마을과'},
  {'entity': 'LOC-B', 'score': 0.9730962514877319, 'word': '금강농산을'},
  {'entity': 'O', 'score': 0.9999989867210388, 'word': '찾아갔다.'}],
 [{'entity': 'DAT-B', 'score': 0.9999848008155823, 'word': '2009년'},
  {'entity': 'DAT-I', 'score': 0.9999673366546631, 'word': '7월'},
  {'entity': 'ORG-B', 'score': 0.9999908804893494, 'word': 'FC서울을'},
  {'entity': 'O', 'score': 0.9999977946281433, 'word': '떠나'},
  {'entity': 'LOC-B', 'score': 0.9999850392341614, 'word': '잉글랜드'},
  {'entity': 'ORG-B', 'score': 0.9999889135360718, 'word': '프리미어리그'},
  {'entity': 'ORG-B', 'score': 0.9999840259552002, 'word': '볼턴'},
  {'entity': 'ORG-I', 'score': 0.9999574422836304, 'word': '원더러스로'},
  {'entity': 'O', 'score': 0.999998927116394, 'word': '이적한'},
  {'entity': 'PER-B', 'score': 0.9999901056289673, 'word': '이청용은'},
  {'entity': 'ORG-B', 'score': 0.9999908208847046, 'word': '크리스탈'},
  {'entity': 'ORG-I', 'score': 0.9999381899833679, 'word': '팰리스와'},
  {'entity': 'LOC-B', 'score': 0.9999315142631531, 'word': '독일'},
  {'entity': 'ORG-B', 'score': 0.9999808073043823, 'word': '분데스리가2'},
  {'entity': 'ORG-B', 'score': 0.998707115650177, 'word': 'VfL'},
  {'entity': 'ORG-I', 'score': 0.9998491406440735, 'word': '보훔을'},
  {'entity': 'O', 'score': 0.9999962449073792, 'word': '거쳐'},
  {'entity': 'DAT-B', 'score': 0.9999901652336121, 'word': '지난'},
  {'entity': 'DAT-I', 'score': 0.9999746084213257, 'word': '3월'},
  {'entity': 'ORG-B', 'score': 0.99996018409729, 'word': 'K리그로'},
  {'entity': 'O', 'score': 0.9999971985816956, 'word': '컴백했다.'},
  {'entity': 'O', 'score': 0.9999988079071045, 'word': '행선지는'},
  {'entity': 'ORG-B', 'score': 0.9999780654907227, 'word': '서울이'},
  {'entity': 'O', 'score': 0.9999967813491821, 'word': '아닌'},
  {'entity': 'ORG-B', 'score': 0.998350203037262, 'word': '울산이었다'}]]
monologg commented 4 years ago

Also, I wanted to know if you were willing to collaborate on a finetuned pos model? My understanding is that we need a conllu dataset such as UD_Korean-GSD and to clean it up to fit the format used by mrm8488 in his notebook I started working on a tool to clean up such datasets but I'm not sure the project is going into the right directions (I'm open to suggestions).

It might be a great idea if we can release the finetuned model for POS tagging. If the dataset is well prepared, it won't take long to make finetuned model.

By the way, what do you mean I'm not sure the project is going into the right directions ? Can you elaborate little bit more about it?

monologg commented 4 years ago

@sachaarbonel

I'll see the dataset you shared and let you know how to use this one to make finetuned model:)

sachaarbonel commented 4 years ago

@monologg sorry for the delay. I didn't use your package locally but through huggingface hosted api.

About "the directions of the project", I principally meant if I was choosing the right dataset or if you knew any others / if it can be useful or not etc. I'm trying to create a general-purpose cleaning tool for the conllu format for generating datasets compatible with POS downstream tasks.

monologg commented 4 years ago

@monologg sorry for the delay. I didn't use your package locally but through huggingface hosted api.

Sadly, this pipeline is only for locally usage.

About "the directions of the project", I principally meant if I was choosing the right dataset or if you knew any others / if it can be useful or not etc. I'm trying to create a general-purpose cleaning tool for the conllu format for generating datasets compatible with POS downstream tasks.

Also I'm not familiar with POS task, so I need some time to check. I'll check this one and let you know ASAP:)

sachaarbonel commented 3 years ago

Hi @monologg! So I've been playing with your model "koelectra-base-v2-discriminator" but I get some hashtags (I don't know if it is normal behavior), do you know how to get rid of them?

Capture d’écran 2020-09-20 à 15 24 49

Do have I to add some preprocessing code after this line ?

https://github.com/huggingface/transformers/blob/6f289dc97aaa1ade5f658ecdd16cc7a842505444/examples/token-classification/utils_ner.py#L111

Something like this :

...
tokens = tokenizer.tokenize(token)
clean_tokens = [remove_hashtag(token) for token in tokens]

remove_hashtag being this function

def remove_hashtag(token):
  if "##" in token:
    result = token.replace("##","")
    return result
  else:
    return token