Undocumented differences between the HKCanCor corpus on HuggingFace vs PyCantonese

The version of HKCanCor published on HuggingFace by NTU is different from the version offered by this library in at least four undocumented ways:

Total token count differs: NTU has 160836 while Pycantonese has 153654
Pycantonese has a different definition of utterance, which seems to be sentences ending with period or question mark. However, the utterances in NTU can span multiple sentences. Also, PyCantonese's segmentation method breaks ending quotation marks, resulting in sentences like "一路一路剝. that starts with a quote.
Pycantonese uses English punctuations marks while NTU uses Chiense punctuations.
Some private use characters in NTU are rewritten with Unicode Chinese characters.

Would it be possible to explain these differences and any preprocessing step done by PyCantonese in the doc somewhere?

You can use the following script to compare the two versions of the corpus:

from datasets import load_dataset
import pycantonese

if __name__ == "__main__":
    print('==== HuggingFace ====')

    hf = load_dataset("nanyang-technological-university-singapore/hkcancor", trust_remote_code=True)

    hf_utterances = []
    hf_tokens = 0

    for utterance in hf['train']:
        hf_utterances.append(''.join(utterance['tokens']))
        hf_tokens += len(utterance['tokens'])

    print('Total tokens:', hf_tokens)

    print('Utterances before deduplication:', len(hf_utterances))
    hf_utterances = sorted(list(set(hf_utterances)))
    print('Utterances after deduplication:', len(hf_utterances))

    longest_hf_utterance = max(hf_utterances, key=len)
    print('Length of the longest utterance:', len(longest_hf_utterance))

    # Load HKCanCor data
    hkcancor_data = pycantonese.hkcancor()
    hkcancor_tokens = 0

    hkcancor_utterances = [''.join(token.word for token in utterance) for utterance in hkcancor_data.tokens(by_utterances=True)]
    for utterance in hkcancor_data.tokens(by_utterances=True):
        hkcancor_tokens += len(utterance)

    print('==== Pycantonese ====')

    print('Total tokens:', hkcancor_tokens)

    print('Utterances before deduplication:', len(hkcancor_utterances))
    hkcancor_utterances = sorted(list(set(hkcancor_utterances)))
    print('Utterances after deduplication:', len(hkcancor_utterances))

    longest_hkcancor_utterance = max(hkcancor_utterances, key=len)
    print('Length of the longest utterance:', len(longest_hkcancor_utterance))

Outputs:

==== HuggingFace ====
Total tokens: 160836
Utterances before deduplication: 10801
Utterances after deduplication: 9117
Length of the longest utterance: 876
==== PyCantonese ====
Total tokens: 153654
Utterances before deduplication: 16162
Utterances after deduplication: 13118
Length of the longest utterance: 145

The differences are due to two main reasons:

I converted the source data into the CHAT data format for compatibility with other conversational datasets in linguistics. That's the reason for English punctuation marks instead of Chinese ones, utterances defined by periods or question marks, and probably also Unicode characters replacing non-Unicode ones. Unfortunately, I've been unable to track down the exact code I used for the conversion and I did this almost ten years ago, so I'm afraid I may not be able to explain every difference.
I downloaded the source data from http://compling.hss.ntu.edu.sg/hkcancor/ (no longer accessible) ten years ago in 2014. The data pulled from HuggingFace comes from https://github.com/fcbond/hkcancor (this line), where data was first committed there in January 2020. There is evidence (e.g., https://github.com/jacksonllee/pycantonese/pull/22) that the original HKCanCor authors must have updated the data within their internal group between 2014 and 2020, which would lead to some of the observed differences.

For reference, here are the relevant commits at the pycantonese codebase:

https://github.com/jacksonllee/pycantonese/commit/a527f2ba95ac4e2ccc64a7921374cce0b8faa1d2 (January 2015): HKCanCor source data added, probably no conversion or preprocessing whatsover
https://github.com/jacksonllee/pycantonese/commit/bc1fa1e0b9574dea10fdf4eb4d0833e413b68b3a (February 2016): data converted to the CHAT format

Perhaps it's possible (ideal?) to re-do the CHAT format conversion work by pulling data from https://github.com/fcbond/hkcancor and keep everything (including the preprocessing/conversion code) properly versioned so that we would be able to track these things -- a project for another day :-)

jacksonllee / pycantonese

Undocumented differences between the HKCanCor corpus on HuggingFace vs PyCantonese #50