jacksonllee / pycantonese

Cantonese Linguistics and NLP
https://pycantonese.org
MIT License
354 stars 38 forks source link

.cha file word segmentation #29

Closed tnwh6921 closed 2 years ago

tnwh6921 commented 2 years ago

Hello, may I please know if it would be possible to word segment a .cha file, or if better, a zip folder containing .cha files? Thank you very much!

jacksonllee commented 2 years ago

Hello! If you have CHAT data with unsegmented Cantonese data, you can iterate through the utterances in your CHAT data (e.g., read in your custom data as ZIP / directory of .cha files / a single .cha file, then loop through the utterances as demo-ed in this tutorial). Each utterance should contain the unsegmented Cantonese text string, and you can apply the PyCantonese functions such as segment to it.

(Relatedly, I'm working on a general parsing function that takes Cantonese text data -- please see #30.)

tnwh6921 commented 2 years ago

Thank you for your reply.

I am very excited to know about the new function! May I please confirm if, with the new function, the input would also have to be utterances instead of a .cha file or zip folder?

Thank you again!

jacksonllee commented 2 years ago

With the new parse_text function, you can have your Cantonese text data in a plain text file (.txt perfectly fine, and no CHAT formatting needed), then read in the text file and pass the text string to parse_text. I haven't tested it yet, but I'd imagine something like the following:

# Suppose you have data.txt with your Cantonese text.
with open("data.txt") as f:
    # `f` is a file object for a plain text file,
    # and so the .read() call in the next line gives you the entire file's text as a string.
    corpus = pycantonese.parse_text(f.read())
    # Then do whatever you'd like with the `corpus` object.

In this hypothetical code snippet, because f.read() is a string, parse_text would attempt simple utterance-level segmentation by the punctuation marks {",", "!", "。"} as well as the EOL character "\n". So this is the case of "input 1: a plain string" described in https://github.com/jacksonllee/pycantonese/issues/30#issue-1004419242. If you'd like more control over what counts as an utterance or not, then you'd have to do your own munging to pass in a list of strings (= the case of "input 2: a list of strings" in https://github.com/jacksonllee/pycantonese/issues/30#issue-1004419242).

tnwh6921 commented 2 years ago

I see. Thank you very much!

jacksonllee commented 2 years ago

The new parse_text function has just been released alongside v3.4.0. More docs here.