jacksonllee / pycantonese

Cantonese Linguistics and NLP
https://pycantonese.org
MIT License
361 stars 39 forks source link

Does Word Segmentation give position of the vocabularies? #42

Open shivanraptor opened 1 year ago

shivanraptor commented 1 year ago

Feature you are interested in and your specific question(s): I'm studying Word Segmentation of PyCantonese (https://pycantonese.org/word_segmentation.html), does the function return also the start & end position of the vocabulary?

What you are trying to accomplish with this feature or functionality: I would like to achieve:

import pycantonese
from pycantonese.word_segmentation import Segmenter
segmenter = Segmenter()
result = pycantonese.segment("廣東話容唔容易學?", cls=segmenter)
print(result)

Current result:

['廣東話', '容', '唔', '容易', '學', '?']

Would like to have the following result (with the start & end position):

[('廣東話', 0, 3), ('容', 3, 4), ('唔', 4, 5), ('容易', 5, 7), ('學', 7, 8), ('?', 8, 9)]

Thanks.

shivanraptor commented 1 year ago

Something like:

def custom_segment(input_str: str):
        segmenter = Segmenter() # possible attributes: disallow, allow, max_word_length
        pyseg = pycantonese.segment(input_str, cls=segmenter)
        tokenized = []
        for word in pyseg:
            start = input_str.find(word)
            end = start + len(word)
            tokenized.append((word, start, end))
        return tokenized