segment-any-text / wtpsplit

Toolkit to segment text into sentences or other semantic units in a robust, efficient and adaptable way.
MIT License
727 stars 42 forks source link

Korean text is not split well #133

Closed seungduk-yanolja closed 2 weeks ago

seungduk-yanolja commented 1 month ago

Hello,

First of all, thank you for the great work! I was excited to try out this powerful text segmentation model, so I tested it with both an English text and a translated Korean text. However, I encountered an issue where a large chunk of the Korean text was considered a single sentence. I tried another sample, but once again, the entire text was returned as a single sentence. Could you please help me figure out what I might be doing wrong?

Thank you in advance.

Code

from wtpsplit import SaT

# onnxruntime GPU
model_ort = SaT("sat-12l-sm", ort_providers=["CUDAExecutionProvider", "CPUExecutionProvider"])
%timeit list(model_ort.split("This is a test This is another test."))
english_split = model_ort.split(text)
korean_split = model_ort.split(korean)

print(len(english_split))
print(len(korean_split))

for en, ko in zip(english_split, korean_split):
    print("English", en)
    print("Korean", ko)
    print("==============")

Result

English wtpsplit๐Ÿช“
Segment any Text - Robustly, Efficiently, Adaptablyโšก

Korean wtpsplit๐Ÿช“
ํ…์ŠคํŠธ ๋ถ„ํ•  - ๊ฐ•๋ ฅํ•˜๊ณ , ํšจ์œจ์ ์ด๋ฉฐ, ์ ์‘๋ ฅ์ด ๋›ฐ์–ด๋‚˜๊ฒŒโšก

==============
English This repository allows you to segment text into sentences or other semantic units. 
Korean ์ด ์ €์žฅ์†Œ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ํ…์ŠคํŠธ๋ฅผ ๋ฌธ์žฅ์ด๋‚˜ ๋‹ค๋ฅธ ์˜๋ฏธ ๋‹จ์œ„๋กœ ๋ถ„ํ• ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. 
==============
English It implements the models from:

Korean ๋‹ค์Œ ๋ชจ๋ธ์„ ๊ตฌํ˜„ํ•ฉ๋‹ˆ๋‹ค.

==============
English SaT โ€” Segment Any Text: A Universal Approach for Robust, Efficient and Adaptable Sentence Segmentation by Markus Frohmann, Igor Sterner, Benjamin Minixhofer, Ivan Vuliฤ‡ and Markus Schedl (state-of-the-art, encouraged).

Korean SaT โ€” Segment Any Text: A Universal Approach for Robust, Efficient and Adaptable Sentence Segmentation, Markus Frohmann, Igor Sterner, Benjamin Minixhofer, Ivan Vuliฤ‡ ๋ฐ Markus Schedl ์ €(์ตœ์ฒจ๋‹จ, ๊ถŒ์žฅ).

==============
English WtP โ€” Whereโ€™s the Point? 
Korean WtP โ€” Whereโ€™s the Point? Self-Supervised Multilingual Punctuation-Agnostic Sentence Segmentation, Benjamin Minixhofer, Jonas Pfeiffer ๋ฐ Ivan Vuliฤ‡ ์ €(์ด์ „ ๋ฒ„์ „, ์žฌํ˜„์„ฑ์„ ์œ„ํ•ด ์œ ์ง€ ๊ด€๋ฆฌ).

==============
English Self-Supervised Multilingual Punctuation-Agnostic Sentence Segmentation by Benjamin Minixhofer, Jonas Pfeiffer and Ivan Vuliฤ‡ (previous version, maintained for reproducibility).

Korean WtP๋ผ๋Š” ์ด๋ฆ„์€ ์ผ๊ด€์„ฑ์„ ์œ„ํ•ด ์œ ์ง€๋ฉ๋‹ˆ๋‹ค. 
==============
English The namesake WtP is maintained for consistency. 
Korean ์ƒˆ๋กœ์šด ํ›„์†์ž‘์ธ SaT๋Š” 85๊ฐœ ์–ธ์–ด์—์„œ ๋” ๋†’์€ ์„ฑ๋Šฅ๊ณผ ๋” ์ ์€ ์ปดํ“จํŒ… ๋น„์šฉ์œผ๋กœ ๊ฐ•๋ ฅํ•˜๊ณ  ํšจ์œจ์ ์ด๋ฉฐ ์ ์‘๋ ฅ์ด ๋›ฐ์–ด๋‚œ ๋ฌธ์žฅ ๋ถ„ํ• ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. Segment any Text ๋…ผ๋ฌธ์—์„œ ์ž…์ฆ๋œ 8๊ฐœ์˜ ์„œ๋กœ ๋‹ค๋ฅธ ๋ง๋ญ‰์น˜์™€ 85๊ฐœ ์–ธ์–ด์— ๋Œ€ํ•œ ์ตœ์ฒจ๋‹จ ๊ฒฐ๊ณผ๋ฅผ ํ™•์ธํ•˜์‹ญ์‹œ์˜ค.

==============
English Our new followup SaT provides robust, efficient and adaptable sentence segmentation across 85 languages at higher performance and less compute cost. 
Korean ์‹œ์Šคํ…œ ๊ทธ๋ฆผ

์„ค์น˜
pip install wtpsplit
์‚ฌ์šฉ๋ฒ•
from wtpsplit import SaT

sat = SaT("sat-3l")
# ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ์œ„ํ•ด GPU์—์„œ ์„ ํƒ์ ์œผ๋กœ ์‹คํ–‰
# ์˜ˆ: sat.to("xla:0")๋ฅผ ํ†ตํ•ด TPU๋„ ์ง€์›, ์ด ๊ฒฝ์šฐ sat.split์— `pad_last_batch=True`๋ฅผ ์ „๋‹ฌ
sat.half().to("cuda")

sat.split("This is a test This is another test.")
# ["This is a test ", "This is another test."] ๋ฐ˜ํ™˜

# ํ›จ์”ฌ ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ์œ„ํ•ด ๋ชจ๋“  ํ…์ŠคํŠธ์— ๋Œ€ํ•ด ๊ฐœ๋ณ„์ ์œผ๋กœ sat.split์„ ํ˜ธ์ถœํ•˜๋Š” ๋Œ€์‹  ์ด ์ž‘์—…์„ ์ˆ˜ํ–‰
sat.split(["This is a test This is another test.", "And some more texts..."])
# ๋ชจ๋“  ํ…์ŠคํŠธ์— ๋Œ€ํ•œ ๋ฌธ์žฅ ๋ชฉ๋ก์„ ์ƒ์„ฑํ•˜๋Š” ๋ฐ˜๋ณต๊ธฐ๋ฅผ ๋ฐ˜ํ™˜

# ์ผ๋ฐ˜์ ์ธ ๋ฌธ์žฅ ๋ถ„ํ•  ์ž‘์—…์—๋Š” '-sm' ๋ชจ๋ธ ์‚ฌ์šฉ
sat_sm = SaT("sat-3l-sm")
sat_sm.half().to("cuda") # ์„ ํƒ ์‚ฌํ•ญ, ์œ„ ์ฐธ์กฐ
sat_sm.split("this is a test this is another test")
# ["this is a test ", "this is another test"] ๋ฐ˜ํ™˜

# ์–ธ์–ด ๋ฐ ๋„๋ฉ”์ธ/์Šคํƒ€์ผ์— ๋Œ€ํ•œ ๊ฐ•๋ ฅํ•œ ์ ์‘์„ ์œ„ํ•ด ํ›ˆ๋ จ๋œ lora ๋ชจ๋“ˆ ์‚ฌ์šฉ
sat_adapted = SaT("sat-3l", style_or_domain="ud", language="en")
sat_adapted.half().to("cuda") # ์„ ํƒ ์‚ฌํ•ญ, ์œ„ ์ฐธ์กฐ
sat_adapted.split("This is a test This is another test.")
# ['This is a test ', 'This is another test'] ๋ฐ˜ํ™˜
ONNX ์ง€์›
๐Ÿš€ ์ด์ œ sat ๋ฐ sat-sm ๋ชจ๋ธ์— ๋Œ€ํ•ด ํ›จ์”ฌ ๋” ๋น ๋ฅธ ONNX ์ถ”๋ก ์„ ํ™œ์„ฑํ™”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค! ๐Ÿš€

sat = SaT("sat-3l-sm", ort_providers=["CUDAExecutionProvider", "CPUExecutionProvider"])
>>> from wtpsplit import SaT
>>> texts = ["This is a sentence. This is another sentence."] * 1000

# PyTorch GPU
>>> model_pytorch = SaT("sat-3l-sm")
>>> model_pytorch.half().to("cuda");
>>> %timeit list(model_pytorch.split(texts))
# 144ms ยฑ 252ฮผs per loop (7๋ฒˆ ์‹คํ–‰, ๊ฐ 10ํšŒ ๋ฐ˜๋ณต์˜ ํ‰๊ท  ยฑ ํ‘œ์ค€ ํŽธ์ฐจ)
# ์ด๋ฏธ ๋งค์šฐ ๋น ๋ฅด์ง€๋งŒ...

# onnxruntime GPU
>>> model_ort = SaT("sat-3l-sm", ort_providers=["CUDAExecutionProvider", "CPUExecutionProvider"])
>>> %timeit list(model_ort.split(texts))
# 94.9ms ยฑ 165ฮผs per loop (7๋ฒˆ ์‹คํ–‰, ๊ฐ 10ํšŒ ๋ฐ˜๋ณต์˜ ํ‰๊ท  ยฑ ํ‘œ์ค€ ํŽธ์ฐจ)
# 
==============
English Check out the state-of-the-art results in 8 distinct corpora and 85 languages demonstrated in our Segment any Text paper.

Korean ...์ด๊ฒƒ์€ ์•ฝ 50% ๋” ๋นจ๋ผ์•ผ ํ•ฉ๋‹ˆ๋‹ค! (RTX 3090์—์„œ ํ…Œ์ŠคํŠธ)

==============
English System Figure

Korean ONNX ๋ชจ๋ธ๊ณผ ํ•จ๊ป˜ LoRA๋ฅผ ์‚ฌ์šฉํ•˜๋ ค๋ฉด:

use_lora: True ๋ฐ ์ ์ ˆํ•œ output_dir: <OUTPUT_DIR>์„ ์‚ฌ์šฉํ•˜์—ฌ scripts/export_to_onnx_sat.py๋ฅผ ์‹คํ–‰ํ•ฉ๋‹ˆ๋‹ค.

==============
English Installation

Korean ๋กœ์ปฌ LoRA ๋ชจ๋“ˆ์ด ์žˆ๋Š” ๊ฒฝ์šฐ lora_path๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

==============
English pip install wtpsplit

Korean HuggingFace ํ—ˆ๋ธŒ์—์„œ LoRA ๋ชจ๋“ˆ์„ ๋กœ๋“œํ•˜๋ ค๋ฉด style_or_domain ๋ฐ language๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

==============
English Usage
from wtpsplit import SaT

Korean ๋ณ‘ํ•ฉ๋œ LoRA ๊ฐ€์ค‘์น˜๊ฐ€ ์žˆ๋Š” ONNX ๋ชจ๋ธ์„ ๋กœ๋“œํ•ฉ๋‹ˆ๋‹ค: sat = SaT(<OUTPUT_DIR>, onnx_providers=["CUDAExecutionProvider", "CPUExecutionProvider"])

==============
English sat = SaT("sat-3l")
# optionally run on GPU for better performance
# also supports TPUs via e.g. sat.to("xla:0"), in that case pass `pad_last_batch=True` to sat.split
sat.half().to("cuda")

Korean ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ๋ชจ๋ธ

==============
English sat.split("This is a test This is another test.")
# returns ["This is a test ", "This is another test."]

# do this instead of calling sat.split on every text individually for much better performance

Korean ์ผ๋ฐ˜์ ์ธ ๋ฌธ์žฅ ๋ถ„ํ•  ๋ชจ๋ธ์ด ํ•„์š”ํ•œ ๊ฒฝ์šฐ -sm ๋ชจ๋ธ(์˜ˆ: sat-3l-sm)์„ ์‚ฌ์šฉํ•˜์‹ญ์‹œ์˜ค. 
==============
English sat.split(["This is a test This is another test.", "And some more texts..."])
# returns an iterator yielding lists of sentences for every text

# use our '-sm' models for general sentence segmentation tasks

Korean ์†๋„์— ๋ฏผ๊ฐํ•œ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์˜ ๊ฒฝ์šฐ 3๊ณ„์ธต ๋ชจ๋ธ(sat-3l ๋ฐ sat-3l-sm)์„ ๊ถŒ์žฅํ•ฉ๋‹ˆ๋‹ค. 
==============
English sat_sm = SaT("sat-3l-sm")
sat_sm.half().to("cuda") # optional, see above

Korean ์†๋„์™€ ์„ฑ๋Šฅ ๊ฐ„์— ํƒ์›”ํ•œ ์ ˆ์ถฉ์•ˆ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. 
==============
English sat_sm.split("this is a test this is another test")
# returns ["this is a test ", "this is another test"]

# use trained lora modules for strong adaptation to language & domain/style

Korean ์ตœ๊ณ ์˜ ๋ชจ๋ธ์€ 12๊ณ„์ธต ๋ชจ๋ธ์ธ sat-12l ๋ฐ sat-12l-sm์ž…๋‹ˆ๋‹ค.

==============
English sat_adapted = SaT("sat-3l", style_or_domain="ud", language="en")

Korean ๋ชจ๋ธ   ์˜์–ด ์ ์ˆ˜   ๋‹ค๊ตญ์–ด ์ ์ˆ˜
sat-1l  88.5    84.3
sat-1l-sm   88.2    87.9
sat-3l  93.7    89.2
sat-3l-lora 96.7    94.8
sat-3l-sm   96.5    93.5
sat-6l  94.1    89.7
sat-6l-sm   96.9    95.1
sat-9l  94.3    90.3
sat-12l 94.0    90.4
sat-12l-lora    97.3    95.9
sat-12l-sm  97.4    96.0
์ ์ˆ˜๋Š” "์˜์–ด"์— ๋Œ€ํ•ด ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ๋ชจ๋“  ๋ฐ์ดํ„ฐ ์„ธํŠธ์— ๋Œ€ํ•œ ๋งคํฌ๋กœ ํ‰๊ท  F1 ์ ์ˆ˜์ด๋ฉฐ, "๋‹ค๊ตญ์–ด"์— ๋Œ€ํ•ด ๋ชจ๋“  ๋ฐ์ดํ„ฐ ์„ธํŠธ์™€ ์–ธ์–ด์— ๋Œ€ํ•œ ๋งคํฌ๋กœ ํ‰๊ท  F1 ์ ์ˆ˜์ž…๋‹ˆ๋‹ค. 
==============
English sat_adapted.half().to("cuda") # optional, see above
sat_adapted.split("This is a test This is another test.")
# returns ['This is a test ', 'This is another test']

Korean "์ ์‘๋จ"์€ LoRA๋ฅผ ํ†ตํ•œ ์ ์‘์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. 
==============
English ONNX Support

Korean ์ž์„ธํ•œ ๋‚ด์šฉ์€ ๋…ผ๋ฌธ์„ ์ฐธ์กฐํ•˜์‹ญ์‹œ์˜ค.

==============
English ๐Ÿš€ You can now enable even faster ONNX inference for sat and sat-sm models! ๐Ÿš€

Korean ๋น„๊ต๋ฅผ ์œ„ํ•ด ๋‹ค๋ฅธ ๋„๊ตฌ์˜ ์˜์–ด ์ ์ˆ˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

==============
English sat = SaT("sat-3l-sm", ort_providers=["CUDAExecutionProvider", "CPUExecutionProvider"])
>>> from wtpsplit import SaT
>>> texts = ["This is a sentence. This is another sentence."] * 1000

# PyTorch GPU
>>> model_pytorch = SaT("sat-3l-sm")
>>> model_pytorch.half().to("cuda");
>>> %timeit list(model_pytorch.split(texts))
# 
Korean ๋ชจ๋ธ   ์˜์–ด ์ ์ˆ˜
PySBD   69.6
SpaCy(sentencizer; ๋‹จ์ผ ์–ธ์–ด)   92.9
SpaCy(sentencizer; ๋‹ค๊ตญ์–ด) 91.5
Ersatz  91.4
Punkt (nltk.sent_tokenize)  92.2

==============
English 144 ms ยฑ 252 ฮผs per loop (mean ยฑ std. dev. of 7 runs, 10 loops each)
# quite fast already, but...

# onnxruntime GPU
>>> model_ort = SaT("sat-3l-sm", ort_providers=["CUDAExecutionProvider", "CPUExecutionProvider"])
>>> %timeit list(model_ort.split(texts))
# 
Korean WtP (3l) 93.9

==============
English 94.9 ms ยฑ 165 ฮผs per loop (mean ยฑ std. dev. of 7 runs, 10 loops each
# 
Korean ์ด ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋Š” ์ด์ „ WtP ๋ชจ๋ธ๋„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค. 
==============
English ...this should be ~50% faster! (tested on RTX 3090)

Korean SaT ๋ชจ๋ธ๊ณผ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ๋ณธ์งˆ์ ์œผ๋กœ ๋™์ผํ•œ ๋ฐฉ์‹์œผ๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

==============
English If you wish to use LoRA in combination with an ONNX model:

Korean from wtpsplit import WtP

wtp = WtP("wtp-bert-mini")
# SaT ๋ชจ๋ธ๊ณผ ์œ ์‚ฌํ•œ ๊ธฐ๋Šฅ
wtp.split("This is a test This is another test.")
WtP ๋ฐ ์žฌํ˜„ ์„ธ๋ถ€ ์ •๋ณด์— ๋Œ€ํ•œ ์ž์„ธํ•œ ๋‚ด์šฉ์€ WtP ๋ฌธ์„œ๋ฅผ ์ฐธ์กฐํ•˜์‹ญ์‹œ์˜ค.

==============
English Run scripts/export_to_onnx_sat.py with use_lora: True and an appropriate output_dir: <OUTPUT_DIR>.

Korean ๋‹จ๋ฝ ๋ถ„ํ• 
SaT๋Š” ์ค„ ๋ฐ”๊ฟˆ ํ™•๋ฅ ์„ ์˜ˆ์ธกํ•˜๋„๋ก ํ›ˆ๋ จ๋˜๋ฏ€๋กœ ๋ฌธ์žฅ ์™ธ์—๋„ ํ…์ŠคํŠธ๋ฅผ ๋‹จ๋ฝ์œผ๋กœ ๋ถ„ํ• ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

==============
English If you have a local LoRA module, use lora_path.

Korean # ๊ฐ๊ฐ ๋ฌธ์žฅ ๋ชฉ๋ก์ด ํฌํ•จ๋œ ๋‹จ๋ฝ ๋ชฉ๋ก์„ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค.

==============
English If you wish to load a LoRA module from the HuggingFace hub, use style_or_domain and language.

Korean # `paragraph_threshold` ์ธ์ˆ˜๋ฅผ ํ†ตํ•ด ๋‹จ๋ฝ ์ž„๊ณ„๊ฐ’์„ ์กฐ์ •ํ•ฉ๋‹ˆ๋‹ค.

==============
English Load the ONNX model with merged LoRA weights: 
Korean sat.split(text, do_paragraph_segmentation=True)
์ ์‘
SaT๋Š” LoRA๋ฅผ ํ†ตํ•ด ๋„๋ฉ”์ธ ๋ฐ ์Šคํƒ€์ผ์„ ์ ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. 
==============
English sat = SaT(<OUTPUT_DIR>, onnx_providers=["CUDAExecutionProvider", "CPUExecutionProvider"])

Korean 81๊ฐœ ์–ธ์–ด์˜ sat-3l ๋ฐ sat-12l์— ๋Œ€ํ•ด Universal Dependencies, OPUS100, Ersatz ๋ฐ TED(์ฆ‰, ASR ์Šคํƒ€์ผ๋กœ ์ „์‚ฌ๋œ ์Œ์„ฑ) ๋ฌธ์žฅ ์Šคํƒ€์ผ์— ๋Œ€ํ•ด ํ›ˆ๋ จ๋œ LoRA ๋ชจ๋“ˆ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. 
==============
English Available Models

Korean ๋˜ํ•œ 6๊ฐœ ์–ธ์–ด์˜ ๋ฒ•๋ฅ  ๋ฌธ์„œ(๋ฒ•๋ฅ  ๋ฐ ํŒ๊ฒฐ), 4๊ฐœ ์–ธ์–ด ์Œ์˜ ์ฝ”๋“œ ์ „ํ™˜ ๋ฐ 3๊ฐœ ์–ธ์–ด์˜ ํŠธ์œ—์— ๋Œ€ํ•œ LoRA ๋ชจ๋“ˆ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. 
==============
English If you need a general sentence segmentation model, use -sm models (e.g., sat-3l-sm) 
Korean ์ž์„ธํ•œ ๋‚ด์šฉ์€ ๋…ผ๋ฌธ์„ ์ฐธ์กฐํ•˜์‹ญ์‹œ์˜ค.

==============
English For speed-sensitive applications, we recommend 3-layer models (sat-3l and sat-3l-sm). 
Korean ๋˜ํ•œ sat-12-no-limited-lookahead์— ๋Œ€ํ•ด 16๊ฐœ ์žฅ๋ฅด์˜ ๊ตฌ์ ˆ ๋ถ„ํ•  ๋ชจ๋“ˆ์„ ์ œ๊ณตํ–ˆ์Šต๋‹ˆ๋‹ค.

==============
English They provide a great tradeoff between speed and performance. 
Korean LoRA ๋ชจ๋“ˆ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋กœ๋“œํ•ฉ๋‹ˆ๋‹ค.

==============
English The best models are our 12-layer models: sat-12l and sat-12l-sm.

Korean # lang_code ๋ฐ style_or_domain์ด ๋ชจ๋‘ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

==============
English Model   
Korean # ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ๋ชจ๋“ˆ์— ๋Œ€ํ•ด์„œ๋Š” <model_repository>/loras ํด๋”๋ฅผ ํ™•์ธํ•˜์‹ญ์‹œ์˜ค.

==============
English English Score   
Korean sat_lora = SaT("sat-3l", style_or_domain="ud", language="en")

==============
English Multilingual Score

Korean sat_lora.split("Hello this is a test But this is different now Now the next one starts looool")

==============
English sat-1l  88.5    84.3

Korean # ์ด์ œ ๋งค์šฐ ๋‹ค๋ฅธ ๋„๋ฉ”์ธ์˜ ๊ฒฝ์šฐ
sat_lora_distinct = SaT("sat-12l", style_or_domain="code-switching", language="es-en")

==============
English sat-1l-sm   88.2    87.9

Korean sat_lora_distinct.split("in the morning over there cada vez que yo decรญa algo รฉl me decรญa algo")
๋ถ„ํ•  ์ž„๊ณ„๊ฐ’์„ ์ž์œ ๋กญ๊ฒŒ ์กฐ์ •ํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค. 
==============
English sat-3l  93.7    89.2

Korean ์ž„๊ณ„๊ฐ’์ด ๋†’์„์ˆ˜๋ก ๋” ๋ณด์ˆ˜์ ์ธ ๋ถ„ํ• ์ด ์ด๋ฃจ์–ด์ง‘๋‹ˆ๋‹ค.

==============
English sat-3l-lora 96.7    94.8

Korean sat.split("This is a test This is another test.", threshold=0.4)

==============
English sat-3l-sm   96.5    93.5

Korean # lora์—๋„ ์œ ์‚ฌํ•˜๊ฒŒ ์ž‘๋™ํ•˜์ง€๋งŒ ์ž„๊ณ„๊ฐ’์ด ๋” ๋†’์Šต๋‹ˆ๋‹ค.

==============
English sat-6l  94.1    89.7

Korean sat_lora.split("Hello this is a test But this is different now Now the next one starts looool", threshold=0.7)

==============
English sat-6l-sm   96.9    95.1

Korean ๊ณ ๊ธ‰ ์‚ฌ์šฉ๋ฒ•
ํ…์ŠคํŠธ์— ๋Œ€ํ•œ ์ค„ ๋ฐ”๊ฟˆ ๋˜๋Š” ๋ฌธ์žฅ ๊ฒฝ๊ณ„ ํ™•๋ฅ ์„ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค.

==============
English sat-9l  94.3    90.3

Korean # ์ค„ ๋ฐ”๊ฟˆ ํ™•๋ฅ ์„ ๋ฐ˜ํ™˜ํ•ฉ๋‹ˆ๋‹ค(๋ฐฐ์น˜ ์ง€์›!).

==============
English sat-12l 94.0    90.4

Korean sat.predict_proba(text)

==============
English sat-12l-lora    97.3    95.9

Korean HuggingFace transformers์—์„œ SaT ๋ชจ๋ธ์„ ๋กœ๋“œํ•ฉ๋‹ˆ๋‹ค.

==============
English sat-12l-sm  97.4    96.0

Korean # ์‚ฌ์šฉ์ž ์ง€์ • ๋ชจ๋ธ์„ ๋“ฑ๋กํ•˜๋Š” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค.

==============
English The scores are macro-average F1 score across all available datasets for "English", and macro-average F1 score across all datasets and languages for "Multilingual". 
Korean import wtpsplit
from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained("segment-any-text/sat-3l-sm") # ๋˜๋Š” ๋‹ค๋ฅธ ๋ชจ๋ธ ์ด๋ฆ„; https://huggingface.co/segment-any-text ์ฐธ์กฐ

==============
English "adapted" means adapation via LoRA; check out the paper for details.

Korean LoRA๋ฅผ ํ†ตํ•ด ์ž์ฒด ๋ง๋ญ‰์น˜์— ์ ์‘
๋ชจ๋ธ์€ ๊ฐ•๋ ฅํ•œ ๋ฐฉ์‹์œผ๋กœ LoRA๋ฅผ ํ†ตํ•ด ํšจ์œจ์ ์œผ๋กœ ์ ์‘ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. 
==============
English For comparison, here the English scores of some other tools:

Korean 10~100๊ฐœ์˜ ํ›ˆ๋ จ๋œ ๋ถ„ํ•  ํ›ˆ๋ จ ๋ฌธ์žฅ๋งŒ์œผ๋กœ๋„ ์„ฑ๋Šฅ์ด ํฌ๊ฒŒ ํ–ฅ์ƒ๋ฉ๋‹ˆ๋‹ค. 
==============
English Model   English Score

Korean ์ด๋ ‡๊ฒŒ ํ•˜๋ ค๋ฉด:

์ €์žฅ์†Œ๋ฅผ ๋ณต์ œํ•˜๊ณ  ์š”๊ตฌ ์‚ฌํ•ญ์„ ์„ค์น˜ํ•ฉ๋‹ˆ๋‹ค.

==============
English PySBD   69.6

Korean git clone https://github.com/segment-any-text/wtpsplit
cd wtpsplit
pip install -r requirements.txt
pip install adapters==0.2.1 --no-dependencies
cd ..

==============
English SpaCy (sentencizer; monolingual)    92.9

Korean ์ด ํ˜•์‹์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค.

==============
English SpaCy (sentencizer; multilingual)   91.5

Korean import torch

torch.save(
    {
        "language_code": {
            "sentence": {
                "dummy-dataset": {
                    "meta": {
                        "train_data": ["train sentence 1", "train sentence 2"],
                    },
                    "data": [
                        "test sentence 1",
                        "test sentence 2",
                    ]
                }
            }
        }
    },
    "dummy-dataset.pth"
)
์„ค์ •์„ ๋งŒ๋“ค๊ฑฐ๋‚˜ ์ˆ˜์ •ํ•ฉ๋‹ˆ๋‹ค. model_name_or_path๋ฅผ ํ†ตํ•ด ๊ธฐ๋ณธ ๋ชจ๋ธ์„ ์ œ๊ณตํ•˜๊ณ  text_path๋ฅผ ํ†ตํ•ด ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ .pth๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

configs/lora/lora_dummy_config.json

LoRA ํ›ˆ๋ จ:

python3 wtpsplit/train/train_lora.py configs/lora/lora_dummy_config.json
ํ›ˆ๋ จ์ด ์™„๋ฃŒ๋˜๋ฉด ์ €์žฅ๋œ ๋ชจ๋“ˆ์˜ ๊ฒฝ๋กœ๋ฅผ SaT์— ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

sat_lora_adapted = SaT("model-used", lora_path="dummy_lora_path")
sat_lora_adapted.split("Some domains-specific or styled text")
์œ„์˜ ๋ฐ์ดํ„ฐ ์„ธํŠธ ์ด๋ฆ„, ์–ธ์–ด ๋ฐ ๋ชจ๋ธ์„ ํ•„์š”์— ๋”ฐ๋ผ ์กฐ์ •ํ•˜์‹ญ์‹œ์˜ค.

==============
English Ersatz  91.4

Korean ๋…ผ๋ฌธ ์žฌํ˜„
configs/์—๋Š” ๊ธฐ๋ณธ ๋ฐ sm ๋ชจ๋ธ๊ณผ LoRA ๋ชจ๋“ˆ์— ๋Œ€ํ•œ ๋…ผ๋ฌธ์˜ ์‹คํ–‰์— ๋Œ€ํ•œ ์„ค์ •์ด ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. 
==============
English Punkt (nltk.sent_tokenize)  92.2

Korean ๊ฐ ์„ค์ •์— ๋Œ€ํ•ด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ํ›ˆ๋ จ์„ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค.

python3 wtpsplit/train/train.py configs/<config_name>.json
python3 wtpsplit/train/train_sm.py configs/<config_name>.json
python3 wtpsplit/train/train_lora.py configs/<config_name>.json
๋˜ํ•œ:

wtpsplit/data_acquisition์—๋Š” mC4 ๋ง๋ญ‰์น˜์—์„œ ํ‰๊ฐ€ ๋ฐ์ดํ„ฐ์™€ ์›์‹œ ํ…์ŠคํŠธ๋ฅผ ์–ป๋Š” ์ฝ”๋“œ๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

==============
English WtP (3l)    93.9

Korean wtpsplit/evaluation์—๋Š” ๋‹ค์Œ์— ๋Œ€ํ•œ ์ฝ”๋“œ๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.
intrinsic.py๋ฅผ ํ†ตํ•œ ํ‰๊ฐ€(์ฆ‰, ๋ฌธ์žฅ ๋ถ„ํ•  ๊ฒฐ๊ณผ).
intrinsic_pairwise.py๋ฅผ ํ†ตํ•œ ๋‹จ์ˆœ ์‹œํ€€์Šค ํ‰๊ฐ€(์ฆ‰, ๋ฌธ์žฅ ์Œ/k-mer์— ๋Œ€ํ•œ ๋ฌธ์žฅ ๋ถ„ํ•  ๊ฒฐ๊ณผ).
LLM ๊ธฐ์ค€ ํ‰๊ฐ€(llm_sentence.py), ๋ฒ•๋ฅ  ๊ธฐ์ค€ ํ‰๊ฐ€(legal_baselines.py)
intrinsic_baselines.py ๋ฐ intrinsic_baselines_multi.py์˜ ๊ธฐ์ค€(PySBD, nltk ๋“ฑ) ํ‰๊ฐ€ ๊ฒฐ๊ณผ

==============
English Note that this library also supports previous WtP models. 
Korean JSON ํ˜•์‹์˜ ์›์‹œ ๊ฒฐ๊ณผ๋Š” evaluation_results/์—๋„ ์žˆ์Šต๋‹ˆ๋‹ค.

==============
English You can use them in essentially the same way as SaTmodels:

Korean ํ†ต๊ณ„์  ์œ ์˜์„ฑ ํ…Œ์ŠคํŠธ ์ฝ”๋“œ ๋ฐ ๊ฒฐ๊ณผ๋Š” stat_tests/์— ์žˆ์Šต๋‹ˆ๋‹ค.

==============
English from wtpsplit import WtP

Korean punct_annotation.py ๋ฐ punct_annotation_wtp.py(WtP ์ „์šฉ)์˜ ๊ตฌ๋‘์  ์ฃผ์„ ์‹คํ—˜
extrinsic.py(WtP ์ „์šฉ)์˜ ๊ธฐ๊ณ„ ๋ฒˆ์—ญ์— ๋Œ€ํ•œ ์™ธ์žฌ์  ํ‰๊ฐ€

==============
English wtp = WtP("wtp-bert-mini")

Korean ๋ฏธ๋ฆฌ requirements.txt์˜ ํŒจํ‚ค์ง€๋ฅผ ์„ค์น˜ํ•˜์‹ญ์‹œ์˜ค.

==============
English # similar functionality as for SaT models

Korean ์ง€์› ์–ธ์–ด
์ง€์› ์–ธ์–ด๊ฐ€ ํฌํ•จ๋œ ํ‘œ

==============
English wtp.split("This is a test 
Korean ์ž์„ธํ•œ ๋‚ด์šฉ์€ Segment any Text ๋…ผ๋ฌธ์„ ์ฐธ์กฐํ•˜์‹ญ์‹œ์˜ค.

==============
English This is another test.")

Korean ์ธ์šฉ
SaT ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ ๋…ผ๋ฌธ์„ ์ธ์šฉํ•ด์ฃผ์‹ญ์‹œ์˜ค.

==============
English For more details on WtP and reproduction details, see the WtP doc.

Korean @article{frohmann2024segment,
    title={Segment Any Text: A Universal Approach for Robust, Efficient and Adaptable Sentence Segmentation},
    author={Frohmann, Markus and Sterner, Igor and Vuli{'c}, Ivan and Minixhofer, Benjamin and Schedl, Markus},
    journal={arXiv preprint arXiv:2406.16678},
    year={2024},
    doi={10.48550/arXiv.2406.16678},
    url={https://doi.org/10.48550/arXiv.2406.16678},
}
๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๋ฐ WtP ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ ๋‹ค์Œ์„ ์ธ์šฉํ•˜์‹ญ์‹œ์˜ค.

==============
English Paragraph Segmentation

Korean @inproceedings{minixhofer-etal-2023-wheres,
    title = "Where{'}s the Point? Self-Supervised Multilingual Punctuation-Agnostic Sentence Segmentation",
    author = "Minixhofer, Benjamin  and
      Pfeiffer, Jonas  and
      Vuli{'c}, Ivan",
    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.acl-long.398",
    pages = "7215--7235"
}

==============
English Since SaT are trained to predict newline probablity, they can segment text into paragraphs in addition to sentences.

Korean ๊ฐ์‚ฌ์˜ ๋ง

==============
English # returns a list of paragraphs, each containing a list of sentences

Korean ์ด ์—ฐ๊ตฌ๋Š” ์˜ค์ŠคํŠธ๋ฆฌ์•„ ๊ณผํ•™ ๊ธฐ๊ธˆ(FWF): P36413, P33526 ๋ฐ DFH-23๊ณผ ์ƒ๋ถ€ ์˜ค์ŠคํŠธ๋ฆฌ์•„ ์ฃผ ๋ฐ ์—ฐ๋ฐฉ ๊ต์œก, ๊ณผํ•™ ๋ฐ ์—ฐ๊ตฌ๋ถ€์˜ LIT-2021-YOU-215 ๋ณด์กฐ๊ธˆ์„ ํ†ตํ•ด ์ „์ ์œผ๋กœ ๋˜๋Š” ๋ถ€๋ถ„์ ์œผ๋กœ ์ž๊ธˆ์ด ์ง€์›๋˜์—ˆ์Šต๋‹ˆ๋‹ค. 
==============
English # adjust the paragraph threshold via the `paragraph_threshold` argument.

Korean ๋˜ํ•œ Ivan Vulic๊ณผ Benjamin Minixhofer๋Š” Ivan Vuliฤ‡์—๊ฒŒ ์ˆ˜์—ฌ๋œ Royal Society University Research Fellowship โ€˜Inclusive and Sustainable Language Technology for a Truly Multilingual Worldโ€™(๋ฒˆํ˜ธ 221137)๋ฅผ ํ†ตํ•ด ์ง€์›์„ ๋ฐ›์•˜์Šต๋‹ˆ๋‹ค. 
==============
English sat.split(text, do_paragraph_segmentation=True)

Korean ์ด ์—ฐ๊ตฌ๋Š” Google์˜ TPU ์—ฐ๊ตฌ ํด๋ผ์šฐ๋“œ(TRC)์˜ ํด๋ผ์šฐ๋“œ TPU๋ฅผ ํ†ตํ•ด์„œ๋„ ์ง€์›๋˜์—ˆ์Šต๋‹ˆ๋‹ค. 
==============
English Adaptation

Korean ์ด ์ž‘์—…์€ Cohere For AI Research Grant์˜ ์ปดํ“จํŒ… ํฌ๋ ˆ๋”ง์œผ๋กœ๋„ ์ง€์›๋˜์—ˆ์œผ๋ฉฐ, ์ด๋Ÿฌํ•œ ๋ณด์กฐ๊ธˆ์€ ๊ณผํ•™์  ์ธ๊ณต๋ฌผ๊ณผ ๋ฐ์ดํ„ฐ๋ฅผ ์ข‹์€ ํ”„๋กœ์ ํŠธ๋ฅผ ์œ„ํ•ด ๊ณต๊ฐœํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ์—ฐ๊ตฌ๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” ํ•™์ˆ  ํŒŒํŠธ๋„ˆ๋ฅผ ์ง€์›ํ•˜๋„๋ก ์„ค๊ณ„๋˜์—ˆ์Šต๋‹ˆ๋‹ค. 
==============
English SaT can be domain- and style-adapted via LoRA. 
Korean ๋˜ํ•œ Simone Teufel์—๊ฒŒ๋„ ์œ ์ตํ•œ ๋…ผ์˜์— ๊ฐ์‚ฌ๋“œ๋ฆฝ๋‹ˆ๋‹ค.

==============
English We provide trained LoRA modules for Universal Dependencies, OPUS100, Ersatz, and TED (i.e., ASR-style transcribed speecjes) sentence styles in 81 languages for sat-3land sat-12l. 
Korean ์งˆ๋ฌธ์ด ์žˆ์œผ๋ฉด ๋ฌธ์ œ๋ฅผ ๋งŒ๋“ค๊ฑฐ๋‚˜ markus.frohmann@gmail.com์œผ๋กœ ์ด๋ฉ”์ผ์„ ๋ณด๋‚ด์ฃผ์‹œ๋ฉด ๊ฐ€๋Šฅํ•œ ํ•œ ๋นจ๋ฆฌ ๋‹ต๋ณ€๋“œ๋ฆฌ๊ฒ ์Šต๋‹ˆ๋‹ค.
==============
seungduk-yanolja commented 1 month ago

I replaced all the newlines with spaces and it seems work.

bminixhofer commented 1 month ago

Thanks for raising this and finding the issue. This seems related to #131 .

markus583 commented 1 month ago

Hi, could you clarify what you mean by "it seems to work" now? As per #131, we find this rather surprising.

Additionally, I think it would be good to try on more natural Korean text, not this semi-translated documentation with many newlines (which is a bit unrealistic, no?)

markus583 commented 3 weeks ago

Hi,would you happent to have any update on this @seungduk-yanolja? :)