pycaprio - upload a document, but can't download typeSystem and cas right now. - Githubissues

inception-project / inception

INCEpTION provides a semantic annotation platform offering intelligent annotation assistance and knowledge management.

https://inception-project.github.io

Apache License 2.0

593 stars 151 forks source link

pycaprio - upload a document, but can't download typeSystem and cas right now. #2973

Closed WilliamQue closed 2 years ago

WilliamQue commented 2 years ago

delALLDocuments()

with StringIO('\n'.join(sentList)) as sio:
    new_document = inceptionClient.api.create_document(currentProjectId, callId, sio, document_format=InceptionFormat.TEXT_SENTENCE_PER_LINE, document_state=DocumentState.DEFAULT)

docs = inceptionClient.api.documents(currentProjectId)
doc = docs[0]

typeSystem, cas = getAnnoDetails(currentProjectId, doc)

for sentence in cas.select(dCasObjTypes['sentence']):  # Error: cas is None
    print(sentence, sentence.get_covered_text())

In the last 2 lines, cas will be reported it's None, but after I log in inception website and open the document uploaded manually, then the last 2 lines will be executed correctly. Is there any steps needed to be done between uploading and getAnnoDetails method?

reckart commented 2 years ago

There is no CAS for a given document/annotator unless the annotator has started editing the document (i.e. the annotation status is not in the states NEW or IGNORE). When you log in an open a document for annotation, the CAS is created for that annotator, the state for the annotator changes to IN_PROGRESS and then you can download the CAS. So you would have to check the annotation status and for documents that a given annotator has not started yet, you could download the source document instead of the annotation document (the annotation CAS is initialized with a copy of the source document CAS when an annotator starts editing.)

WilliamQue commented 2 years ago

As common annotation process, raw text files should be annotated by machine (using regular-expressions, or other ways of pattern recognization) first, that can deal simple things, reduce the work load of anntators, then let anntators pay more attention to find deep and complex patterns . Now, raw text files have been tokenized and annotated, I upload them to inception to do more deep work by annotators, so is there a way to create CAS automatically right now after the document uploaded, then annotations generated by machine will be added first to avoid annotators starting from zero.

You can add one more document state to make it possbile.

reckart commented 2 years ago

When you import a document (which may be pre-annotated), internally a "source document" entry and an "initial CAS" is created from that. This "initial CAS" contains all the annotations that were on the text when you imported it.

When an annotators starts annotation, an "annotation document" entry and "annotation CAS" is created. The "annotation CAS" is created by creating a copy of the "initial CAS". This copy is then private to the particular annotator user for which it was created. It starts with all the annotations that were present in the "initial CAS" - so they do not have to start from zero.

WilliamQue commented 2 years ago

O~~, I know! Making an initial CAS -> getting a copy -> changing sofa text and adding annoations -> uploading, Before the doc opening, everything is ok.

But a manual step is still needed.

reckart commented 2 years ago

If you want to retrieve CASes via the remote API, you should check the status of the CAS. You may not want to retrieve it unless its state is "finished". If you actually do want to retrieve it irrespective of the state, then you should check the state and export the initial CAS if the state is "new" and otherwise the annotation CAS. To get the initial CAS, you can use the download document call instead of the download annotation call. I think that removes any manual step.

reckart commented 2 years ago

What do you mean by "changing sofa text"? Actually the sofa text not really changeable.

WilliamQue commented 2 years ago

I' v tested the following successfully.

dCasObjTypes['document'] = 'de.tudarmstadt.ukp.dkpro.core.api.metadata.type.DocumentMetaData'
...

typeSystem, cas_template = getAnnoDetails(currentProjectId, templateDocId)
for type in ['document', 'sentence', 'token']:
    for tag in cas_template.select(dCasObjTypes[type]):
        cas_template.remove(tag)

cas = cassis.load_cas_from_xmi(cas_template.to_xmi(),typesystem=typeSystem)
cas.sofa_string = '\n'.join(t['text'])

...
cas.add(some_annotations)
...

with io.StringIO(cas.to_xmi()) as sio:
    new_document = inceptionClient.api.create_document(currentProjectId, docName, sio, document_format=InceptionFormat.UIMA_CAS_XMI_XML_1_0, document_state=DocumentState.ANNOTATION_IN_PROGRESS)

reckart commented 2 years ago

Are you trying to point out that cassis allows changing the sofa string?

WilliamQue commented 2 years ago

@reckart Sofa string can be changed, I'v tested.

reckart commented 2 years ago

It is a peculiarity of the cassis implementation. In python, it's difficult to prevent things effectively anyway. You can change the sofa string in Python, but e.g. not in the UIMA Java implementation. Also, INCEpTION will reject it if you would e.g. try to import annotations onto a document if the sofa string in the document and in the annotation XMI differs. But if it helps you that in cassis the sofa string can be changed, it is probably a good thing.

reckart commented 2 years ago

I believe all open questions have been resolved - closing.