Closed tnwh6921 closed 2 years ago
Hello! If you have CHAT data with unsegmented Cantonese data, you can iterate through the utterances in your CHAT data (e.g., read in your custom data as ZIP / directory of .cha files / a single .cha file, then loop through the utterances as demo-ed in this tutorial). Each utterance should contain the unsegmented Cantonese text string, and you can apply the PyCantonese functions such as segment to it.
(Relatedly, I'm working on a general parsing function that takes Cantonese text data -- please see #30.)
Thank you for your reply.
I am very excited to know about the new function! May I please confirm if, with the new function, the input would also have to be utterances instead of a .cha file or zip folder?
Thank you again!
With the new parse_text
function, you can have your Cantonese text data in a plain text file (.txt
perfectly fine, and no CHAT formatting needed), then read in the text file and pass the text string to parse_text
. I haven't tested it yet, but I'd imagine something like the following:
# Suppose you have data.txt with your Cantonese text.
with open("data.txt") as f:
# `f` is a file object for a plain text file,
# and so the .read() call in the next line gives you the entire file's text as a string.
corpus = pycantonese.parse_text(f.read())
# Then do whatever you'd like with the `corpus` object.
In this hypothetical code snippet, because f.read()
is a string, parse_text
would attempt simple utterance-level segmentation by the punctuation marks {",", "!", "。"} as well as the EOL character "\n". So this is the case of "input 1: a plain string" described in https://github.com/jacksonllee/pycantonese/issues/30#issue-1004419242. If you'd like more control over what counts as an utterance or not, then you'd have to do your own munging to pass in a list of strings (= the case of "input 2: a list of strings" in https://github.com/jacksonllee/pycantonese/issues/30#issue-1004419242).
I see. Thank you very much!
The new parse_text
function has just been released alongside v3.4.0. More docs here.
Hello, may I please know if it would be possible to word segment a .cha file, or if better, a zip folder containing .cha files? Thank you very much!