j0ma / ancestral-decipherment

1 stars 0 forks source link

Corpus counts #1

Open j0ma opened 1 year ago

j0ma commented 1 year ago

Notes on corpora

Corpus counts

Number of lines/sentences

PanLex

11450 train 2454 dev 2454 test

Tatoeba

train 2387 dev 512 test 512

Number of space-separated tokens

PanLex

11450 train 2454 dev 2454 test

Tatoeba

train 12562 dev 2618 test 2792

Number of unique tokens

PanLex

6673 train 2050 dev 2048 test

Tatoeba

1575 train 893 dev 933 test