liesenf / MYCanCor

Malaysia Cantonese Corpus (MYCanCor) - A video corpus of natural Cantonese conversations
7 stars 0 forks source link

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">

Malaysia Cantonese Corpus (MYCanCor)
馬來西亞粵語語料庫

Andreas Liesenfeld  (Twitter: @a_liesenfeld) 



Introduction 簡介

The Malaysia Cantonese Corpus (MYCanCor) is a collection of recordings of Malaysian Cantonese speech mainly collected in Perak, Malaysia. The corpus consists of around 20 hours of video recordings of spontaneous talk-in-interaction (56 settings) typically involving 2-4 speakers. A short scene description as well as basic speaker information is provided for each recording. The corpus is transcribed in CHAT (minCHAT) format and presented in traditional Chinese characters (UTF8) using the Hong Kong Supplementary Character Set (HKSCS). MYCanCor is expected to be a useful resource for researchers interested in any aspect of spoken language processing or Chinese multimodal corpora. 

馬來西亞粵語語料庫(MYCanCor)收錄的馬來西亞粵語對話主要錄影於馬來西亞霹靂州。本語料庫修錄了56個場景、約20小時的自然對話錄影資料,大多數情況下由24位參與者進行談話。語料庫附有每一段錄影的發生場景說明以及參與者的基本信息。轉錄格式採用了CHAT(minCHAT),以繁體中文(UTF8)和香港增補字符集(HKSCS)標註。該語料庫致力於為對中文口頭語言處理或對錄音、影像等多種模式的中文素材感興趣的研究者提供分析資源。


Request access to the data  申請訪問資源

   Sign up to download MYCanCor (200GB) (to appear)


Downloads 下載

 

References 參考文獻


Links 連結

Transcription Example (CHAT format)

@Begin
@Languages: zho-yue
@Participants: P1 Wong Older Sister, P2 Chan Younger Sister
@ID: zho-yue|mycancor|P1|27;1.10|||| Target_P2|||
@ID: zho-yue|mycancor|P2|39;2.|||| Target_P1|||

*P2: 你食咩啊.
%com: every utterance ends with an utterance terminator (period). 

*P1: 白果薏米.
*P2: 同怡保好似好唔同 哈哈哈.
*P1: 唔同啊(0.1)冇得比啦. 

%com: Utterances are segmented by pauses exceeding 0.1 seconds.

*P2: 依但係.
*P1: 但因為 因為佢冇煮溶個. 

%com: All lexical items (onset, nucleus, coda, tone) are transcribed as Chinese characters.

%act: P2 points at the bowl.

%com: Gestures may be annotated as informal descriptions.

*P2: xxx睇下 個腐竹.
%com: Unintelligible or incomplete lexical units are transcribed as xxx.
*P1: 個腐竹 係咯 同埋唔知點解佢 唔係白色咯.
%com: Modal Participles and Modal Particle Morphemes are transcribed following UTF8+HKSDS conventions.

@End 


Tagset

See the paper for a full description.

No.

Tag

POS (in Chinese)

POS (in English)

1

Ag

形语素

Adjective Morpheme

2

a

形容词

Adjective

3

ad

副形词

Adjective as Adverbial

4

an

名形词

Adjective with Nominal Function

5

Bg

区别语素

Non-predicate Adjective Morpheme

6

b

区别词

Non-predicate Adjective

7

c

连词

Conjunction

8

Dg

副语素

Adverb Morpheme

9

d

副词

Adverb

10

e

叹词

Interjection

11

f

方位词

Directional Locality

12

g

语素

Morpheme

13

h

前接成分

Prefix

14

i

成语

Idiom

15

j

简略语

Abbreviation

16

k

后接成分

Suffix

17

l

习用语

Fixed Expression

18

Mg

数语素

Numeric Morpheme

19

m

数词

Numeral

20

Ng

名语素

Noun Morpheme

21

n

名词

Common Noun

22

nr

人名

Personal Name

23

ns

地名

Place Name

24

nt

机构团体

Organisation Name

25

nx

外文字符

Nominal Character String

26

nz

其它专名

Other Proper Noun

27

o

拟声词

Onomatopoeia

28

p

介词

Preposition

29

Qg

量语素

Classifier Morpheme

30

q

量词

Classifier

31

Rg

代语素

Pronoun Morpheme

32

r

代词

Pronoun

33

s

处所词

Space Word

34

Tg

时间语素

Time Word Morpheme

35

t

时间词

Time Word

36

Ug

助语素

Auxiliary Morpheme

37

u

助词

Auxiliary

38

Vg

动语素

Verb Morpheme

39

v

动词

Verb

40

vd

副动词

Verb as Adverbial

41

vn

名动词

Verb with Nominal Function

42

w

标点符号

Punctuation

43

x

非语素字

Unclassified Item

44

Yg

语气语素

Modal Particle Morpheme

45

y

语气词

Modal Particle

46

z

状态词

Descriptive