How to modify the spaCy NER or POS for Chinese which has no space as tokenization mark?

fishfree commented 1 year ago

spaCy has built-in automatically tokenization action before POS or NER when loading Chinese language model such as zh_core_web_sm. But in the function predict() in the class SpacyNerClassifier, it use tokenized list as the Doc class __init parameters.

        # Extract the tokens from the CAS and create a spacy doc from it
        cas_tokens = cas.select(TOKEN_TYPE)
        words = [cas.get_covered_text(cas_token) for cas_token in cas_tokens]

        **doc = Doc(self._model.vocab, words=words)**

        # Find the named entities
        self._model.get_pipe("ner")(doc)

I cannot figure out how to just use the normal pipelines instead of "cas.select(TOKEN_TYPE)". Or to ask: which input parameter in the predict funciton is the document?

reckart commented 1 year ago

If you want the document text instead of the tokens, you can use cas.sofa_string.

fishfree commented 1 year ago

@reckart Thank you! But I don't know how to modify the code, sorry! Would you pls share more details?

fishfree commented 1 year ago

@reckart Sorry to be at you again. Could you help me plsss? :-)

reckart commented 1 year ago

I can point you to the relevant documentation, but I'm afraid I cannot teach you programming.

fishfree commented 1 year ago

@reckart OK, please point the doc. Thank you!

reckart commented 1 year ago

My understanding is that you want to pass the entire text of the document to some spacy function and you do not know how to get the entire text - try using cas.sofa_string. (note the link).

fishfree commented 1 year ago

@reckart Thank you! So I modified the code in the function predict() in the class SpacyNerClassifier as below:

        # Extract the tokens from the CAS and create a spacy doc from it
        #cas_tokens = cas.select(TOKEN_TYPE)
        #words = [cas.get_covered_text(cas_token) for cas_token in cas_tokens]

        #doc = Doc(self._model.vocab, words=words)
        doc = Doc(self._model.vocab, words=cas.sofa_string)
        # Find the named entities
        self._model.get_pipe("ner")(doc)

        # For every entity returned by spacy, create an annotation in the CAS
        for named_entity in doc.ents:
            #begin = cas_tokens[named_entity.start].begin
            #end = cas_tokens[named_entity.end - 1].end
            begin = (cas.sofa_string)[named_entity.start].begin
            end = (cas.sofa_string)[named_entity.end - 1].end
            label = named_entity.label_
            prediction = create_prediction(cas, layer, feature, begin, end, label)
            cas.add_annotation(prediction)

However, there is no longer predictions. I think that's because I wrongly modify the begin and end calculation method. What's more, it would be great if the TOKEN_TYPE can support jieba tokenization specifically for Chinese. But I don't konw how program to add a new TOKEN_TYPE.

fishfree commented 1 year ago

@reckart Plese ignore my last reply. I just found there is token type de.tudarmstadt.ukp.dkpro.core.mecab.type.JapaneseToken in the file So I change this into:

        #cas_tokens = cas.select(TOKEN_TYPE)
        cas_tokens = cas.select('de.tudarmstadt.ukp.dkpro.core.mecab.type.JapaneseToken')

However, the error occurred as below: It seems we need to implement or inherit some interfaces or classes? which ones?

reckart commented 1 year ago

INCEpTION creates token of the type de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token and does not understand other types of tokens. You can pre-process your data before importing it into INCEpTION in such a way that you would create a de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token for every single character (or word in your language) and a de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Sentence for a run of characters that represent a sentence in your language - but this would require you to do some programming.

Cf. https://colab.research.google.com/github/inception-project/inception/blob/master/notebooks/using_pretokenized_and_preannotated_text.ipynb

Regarding your modification of the spacy recommender: I don't know what named_entity.start represents, but if it is a character offset in the document, then you can use it directly as the begin of the prediction. If it is a spacy token index, then you would need to find the respective spacy token and obtain its start character offset and use that as the begin.

fishfree commented 1 year ago

Thank you for your tip.

inception-project / inception-external-recommender

How to modify the spaCy NER or POS for Chinese which has no space as tokenization mark? #47