heehehe / melon-playlist-continuation

멜론 플레이리스트 추천 시스템 (Melon Playlist Continuation)
0 stars 0 forks source link

[heehehe] khaiii 활용한 tag 추출 고도화 #5

Closed heehehe closed 3 years ago

heehehe commented 4 years ago

plylst_title 활용한 tags 추출

기존 tags 활용한 tags 추가 추출


https://colab.research.google.com/drive/1NJI692y8ZgcQbETWyjBHJH1vUrwv9rRw?usp=sharing

Seong-Han commented 4 years ago
heehehe commented 4 years ago
heehehe commented 4 years ago

train 태그 전처리 없이 형태소 분석한 결과

총 722,860개 형태소 존재 image 품사 및 해당 개수 [('NNG', 441528), ('NNP', 55476), ('ETM', 37410), ('SL', 31299), ('VV', 24006), ('XSA', 23398), ('XR', 19608), ('MAG', 12875), ('JX', 10237), ('ETN', 8496), ('EC', 8372), ('VA', 8109), ('SN', 7700), ('XSN', 5605), ('JKB', 4754), ('NNB', 3702), ('SS', 3369), ('VCP', 3337), ('JKS', 2194), ('VX', 2095), ('XSV', 1921), ('NP', 1568), ('JKO', 989), ('JKG', 945), ('MM', 904), ('IC', 853), ('XPN', 664), ('EP', 530), ('JC', 401), ('NR', 193), ('SH', 99), ('SW', 66), ('MAJ', 41), ('VCN', 23), ('JKV', 21), ('ZZ', 16), ('SO', 14), ('EF', 12), ('SE', 11), ('JKC', 10), ('SP', 6), ('ZV', 3)]

train 태그 전처리 후 형태소 분석 결과 (품사가 'ETM','ETN','XSA','JX','VV'인 길이 1인 형태소 제거)

총 623,244개 형태소 존재 image [('NNG', 441528), ('NNP', 55476), ('SL', 31299), ('XR', 19608), ('MAG', 12875), ('EC', 8372), ('VA', 8109), ('SN', 7700), ('XSN', 5605), ('JKB', 4754), ('NNB', 3702), ('VV', 3687), ('SS', 3369), ('VCP', 3337), ('JKS', 2194), ('VX', 2095), ('XSV', 1921), ('NP', 1568), ('JKO', 989), ('JKG', 945), ('MM', 904), ('IC', 853), ('XPN', 664), ('EP', 530), ('JC', 401), ('NR', 193), ('JX', 138), ('SH', 99), ('XSA', 89), ('SW', 66), ('MAJ', 41), ('VCN', 23), ('JKV', 21), ('ETM', 17), ('ZZ', 16), ('SO', 14), ('EF', 12), ('SE', 11), ('JKC', 10), ('SP', 6), ('ZV', 3)]

heehehe commented 4 years ago

많이 나온 형태소 대해서 분석 진행할지 결정 --> train에서 title 형태소 분석하면서 판단하기

title에서 태그 추출방식