Closed fishfree closed 1 year ago
If you want the document text instead of the tokens, you can use cas.sofa_string
.
@reckart Thank you! But I don't know how to modify the code, sorry! Would you pls share more details?
@reckart Sorry to be at you again. Could you help me plsss? :-)
I can point you to the relevant documentation, but I'm afraid I cannot teach you programming.
@reckart OK, please point the doc. Thank you!
My understanding is that you want to pass the entire text of the document to some spacy function and you do not know how to get the entire text - try using cas.sofa_string
. (note the link).
@reckart Thank you! So I modified the code in the function predict() in the class SpacyNerClassifier as below:
# Extract the tokens from the CAS and create a spacy doc from it
#cas_tokens = cas.select(TOKEN_TYPE)
#words = [cas.get_covered_text(cas_token) for cas_token in cas_tokens]
#doc = Doc(self._model.vocab, words=words)
doc = Doc(self._model.vocab, words=cas.sofa_string)
# Find the named entities
self._model.get_pipe("ner")(doc)
# For every entity returned by spacy, create an annotation in the CAS
for named_entity in doc.ents:
#begin = cas_tokens[named_entity.start].begin
#end = cas_tokens[named_entity.end - 1].end
begin = (cas.sofa_string)[named_entity.start].begin
end = (cas.sofa_string)[named_entity.end - 1].end
label = named_entity.label_
prediction = create_prediction(cas, layer, feature, begin, end, label)
cas.add_annotation(prediction)
However, there is no longer predictions. I think that's because I wrongly modify the begin and end calculation method. What's more, it would be great if the TOKEN_TYPE can support jieba tokenization specifically for Chinese. But I don't konw how program to add a new TOKEN_TYPE.
@reckart Plese ignore my last reply. I just found there is token type de.tudarmstadt.ukp.dkpro.core.mecab.type.JapaneseToken in the file So I change this into:
#cas_tokens = cas.select(TOKEN_TYPE)
cas_tokens = cas.select('de.tudarmstadt.ukp.dkpro.core.mecab.type.JapaneseToken')
However, the error occurred as below: It seems we need to implement or inherit some interfaces or classes? which ones?
INCEpTION creates token of the type de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token
and does not understand other types of tokens. You can pre-process your data before importing it into INCEpTION in such a way that you would create a de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token
for every single character (or word in your language) and a de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Sentence
for a run of characters that represent a sentence in your language - but this would require you to do some programming.
Regarding your modification of the spacy recommender: I don't know what named_entity.start
represents, but if it is a character offset in the document, then you can use it directly as the begin
of the prediction. If it is a spacy token index, then you would need to find the respective spacy token and obtain its start character offset and use that as the begin
.
Thank you for your tip.
spaCy has built-in automatically tokenization action before POS or NER when loading Chinese language model such as zh_core_web_sm. But in the function predict() in the class SpacyNerClassifier, it use tokenized list as the Doc class __init parameters.
I cannot figure out how to just use the normal pipelines instead of "cas.select(TOKEN_TYPE)". Or to ask: which input parameter in the predict funciton is the document?