vncorenlp / VnCoreNLP

A Vietnamese natural language processing toolkit (NAACL 2018)
Other
587 stars 145 forks source link

Different size of tokens and words #1

Closed villahp closed 6 years ago

villahp commented 6 years ago

If the string has a number at the end of its. The size of list in getTokens() response different with the size of getWords().

datquocnguyen commented 6 years ago

Hi, I do not really understand what your problem is ? A "token" here is a syllable, i.e. not a "word". So it is likely that the number of tokens/syllables is different to the number of words in a sentence.

villahp commented 6 years ago

How can I get list of segmented words? Thanks in advance

tienthanhdhcn commented 6 years ago

@villahp The getWords function from Annotation will return the list of segmented words. The sample code is as follows

import vn.pipeline.*;
import java.io.*;

public class VnCoreNLPExample {
    public static void main(String[] args) throws IOException {
        String str = "Bà Ngọc Lan đang đến thăm Hà Nội.";
        String[] annotators = {"wseg"};
        VnCoreNLP pipeline = new VnCoreNLP(annotators);
        Annotation annotation = new Annotation(str);
        pipeline.annotate(annotation);
        for (Word word : annotation.getWords())
            System.out.println(word.getForm());
    }
}