ko-nlp / moducorpus-sanitizer

모두의 말뭉치 데이터를 분석에 편리한 형태로 변환하는 기능을 제공합니다.
MIT License
11 stars 0 forks source link

[말뭉치 통계] #7

Open lovit opened 4 years ago

lovit commented 4 years ago

뉴스 말뭉치

(code)

from tqdm import tqdm

n_chars, n_eojeols = 0, 0

with open('paragraph.txt') as f: 
    for line in tqdm(f, desc='scan', total=40807613): 
        n_chars += len(line.replace(' ', '').strip()) 
        n_eojeols += len(line.strip().split()) 

(문제점 1) 문장이 구분되어 있지 않으며, 한 문단 내 두 문장이 구두점을 포함하여 띄어쓰기 구분이 없음.

lovit commented 4 years ago

형태 분석 말뭉치

tagset 과 tag count Tag Count (percentage)
NNG 1562168 (24.06%)
VV 403501 (6.214%)
EC 386549 (5.953%)
ETM 298970 (4.604%)
EF 258542 (3.982%)
JKB 256604 (3.952%)
JX 251864 (3.879%)
NNB 244696 (3.769%)
SF 228243 (3.515%)
NNP 224329 (3.455%)
SS 209491 (3.226%)
JKO 207376 (3.194%)
MAG 182313 (2.808%)
XSV 178325 (2.746%)
JKS 167515 (2.58%)
EP 157874 (2.431%)
VA 146731 (2.26%)
SN 140913 (2.17%)
XSN 107643 (1.658%)
IC 100695 (1.551%)
VX 100233 (1.544%)
VCP 95837 (1.476%)
JKG 86943 (1.339%)
NP 76386 (1.176%)
SP 68049 (1.048%)
MMD 53649 (0.8263%)
JC 32886 (0.5065%)
NR 32627 (0.5025%)
MAJ 31093 (0.4789%)
XSA 24422 (0.3761%)
ETN 23355 (0.3597%)
JKQ 20562 (0.3167%)
SL 20508 (0.3158%)
MMN 20204 (0.3112%)
SW 18773 (0.2891%)
XPN 16159 (0.2489%)
JKC 11436 (0.1761%)
VCN 10049 (0.1548%)
NA 8141 (0.1254%)
SH 7722 (0.1189%)
SO 7529 (0.116%)
MMA 5437 (0.08374%)
SE 4806 (0.07402%)
XR 560 (0.008625%)
NAP 555 (0.008548%)
NF 517 (0.007962%)
JKV 181 (0.002788%)
NV 89 (0.001371%)