jacksonllee / pycantonese

Cantonese Linguistics and NLP
https://pycantonese.org
MIT License
354 stars 38 forks source link

Return POS and the character? #19

Closed jacoblhchan closed 5 years ago

jacoblhchan commented 5 years ago

Hi, It is possible to return all of the sentences in character from the corpus with its pos tag?

jacksonllee commented 5 years ago

Hello! Yes, the pycantonese API mimics NLTK for corpus data access. Once you have a corpus object (currently pycantonese includes the HKCanCor dataset), there are associated methods such as tagged_sents() to return "sentences" where each word is represented as a tuple with the character and POS tag. So you could do something like this:

In [1]: import pycantonese as pc

In [2]: hkcancor = pc.hkcancor()

In [3]: tagged_sents = hkcancor.tagged_sents()

In [4]: for tagged_sent in tagged_sents[:2]:  # Just print the first two "sentences" in the HKCanCor dataset. Remove [:2] to do whatever you want in the loop for the entire dataset.
   ...:     print(tagged_sent)
   ...:
[('喂', 'E', 'wai3', ''), ('遲', 'A', 'ci4', ''), ('啲', 'U', 'di1', ''), ('去', 'V', 'heoi3', ''), ('唔', 'D', 'm4', ''), ('去', 'V', 'heoi3', ''), ('旅行', 'VN', 'leoi5hang4', ''), ('啊', 'Y', 'aa3', ''), ('?', '?', '', '')]
[('你', 'R', 'nei5', ''), ('老公', 'N', 'lou5gung1', ''), ('有冇', 'V1', 'jau5mou5', ''), ('平', 'A', 'peng4', ''), ('機票', 'N', 'gei1piu3', ''), ('啊', 'Y', 'aa3', ''), ('?', '?', '', '')]

Is this what you're after?

jacoblhchan commented 5 years ago

yes that is it! thanks for the quick reply Jackson. I am trying to use this to bootstrap and build a POS tagger.