twitter / twitter-korean-text

Korean tokenizer
Apache License 2.0
857 stars 172 forks source link

OOM Issue #109

Open gbadiali opened 7 years ago

gbadiali commented 7 years ago

The following strings take more than 500ms to be tokenized:

1) "@Bam_cos0118 실물을 봤으닊 이러죠!!!!!!!!!꺄윽ㅇㅇ썈꺅!!!!!!!!!!!!!밤님 우주최강지구최강중국최강한국최강그리스최강호주최강미국최강북한최강ㅇ일본최강홍콩최강대만최강마카오최강아프리카최강우즈베키스탄최강!!!!존예!!!존귀!!!시라구!!!" 2) "한국일보 6월3일자 만평 https://t.co/nnZCJovw0w"

We also run into OOM errors when tokenizing many of these in a row. java.lang.OutOfMemoryError: GC overhead limit exceeded VM error: GC overhead limit exceeded

hohyon-ryu commented 7 years ago

Hi @gbadiali, thanks for reporting the issue. I currently lost access to this repo so I cannot merge PRs or publish new changes. And it is actually quite complicated to apply updates in this repo to Penguin as Penguin needs to support 2 separate versions. I can give you some help if you want to fix on your side or anyone at Twitter wants to own fix.