inception-project / inception

INCEpTION provides a semantic annotation platform offering intelligent annotation assistance and knowledge management.
https://inception-project.github.io
Apache License 2.0
593 stars 151 forks source link

Cannot export texts containing certain characters as UIMA CAS XMI #4058

Closed nschneid closed 1 year ago

nschneid commented 1 year ago

Describe the bug

Project backup (xmi-xml1.1) Unexpected error during project export: SAXParseException: Trying to serialize non-XML 1.1 character: 0x0 at offset 5 in string starting with PK

To Reproduce

Project Settings > Export > Backup export with Secondary format: UIMA CAS XMI (XML 1.1)

Expected behavior

No response

Screenshots

No response

Environment

Version and build ID: INCEpTION -- 28.1 (2023-05-26 16:54:12, build 867bcf14) Operating system: macOS 13.3.1 (a) Java: openjdk version "11.0.19" 2023-04-18 Browser: Firefox 114.0

Additional context

CAS Doctor doesn't show anything suspicious.

reckart commented 1 year ago

The XML standard does not allow certain characters to part of the XML document. While the XML 1.1 standard allows more than the XML 1.0 standard, there are still some forbidden characters even in XML 1.1.

nschneid commented 1 year ago

Thanks. The thing is, I can export the individual documents I have annotated to the format. I just cannot export the entire project. Any idea why this might be?

reckart commented 1 year ago

Can you provide the part of the log output that contains the stack trace any maybe a few lines before?

nschneid commented 1 year ago
2023-06-08 15:36:22 INFO [SYSTEM] DocumentImportExportServiceImpl - Exported annotations [12628561_ootc_sotomayor.txt](2) for user [admin] from project [CuRIAM Agreement Study](0) using format [xmi]
2023-06-08 15:36:22 INFO [SYSTEM] AnnotationDocumentExporter - Exported annotation document content for user [admin] for source document [12628561_ootc_sotomayor.txt](2) in project [CuRIAM Agreement Study](0)
2023-06-08 15:36:22 ERROR [SYSTEM] BackupProjectExportTask - Unexpected error during project export
de.tudarmstadt.ukp.clarin.webanno.api.export.ProjectExportException: Project export failed
    at de.tudarmstadt.ukp.inception.project.export.ProjectExportServiceImpl.exportProjectToPath(ProjectExportServiceImpl.java:277) ~[inception-project-export-28.1.jar!/:?]
    at de.tudarmstadt.ukp.inception.project.export.ProjectExportServiceImpl.exportProject(ProjectExportServiceImpl.java:206) ~[inception-project-export-28.1.jar!/:?]
    at de.tudarmstadt.ukp.inception.project.export.ProjectExportServiceImpl.exportProject(ProjectExportServiceImpl.java:181) ~[inception-project-export-28.1.jar!/:?]
    at de.tudarmstadt.ukp.inception.project.export.ProjectExportServiceImpl$$FastClassBySpringCGLIB$$fe9018a4.invoke(<generated>) ~[inception-project-export-28.1.jar!/:?]
    at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:218) ~[spring-core-5.3.27.jar!/:5.3.27]
    at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:793) ~[spring-aop-5.3.27.jar!/:5.3.27]
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163) ~[spring-aop-5.3.27.jar!/:5.3.27]
    at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:763) ~[spring-aop-5.3.27.jar!/:5.3.27]
    at org.springframework.transaction.interceptor.TransactionInterceptor$1.proceedWithInvocation(TransactionInterceptor.java:123) ~[spring-tx-5.3.27.jar!/:5.3.27]
    at org.springframework.transaction.interceptor.TransactionAspectSupport.invokeWithinTransaction(TransactionAspectSupport.java:388) ~[spring-tx-5.3.27.jar!/:5.3.27]
    at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:119) ~[spring-tx-5.3.27.jar!/:5.3.27]
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186) ~[spring-aop-5.3.27.jar!/:5.3.27]
    at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:763) ~[spring-aop-5.3.27.jar!/:5.3.27]
    at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:708) ~[spring-aop-5.3.27.jar!/:5.3.27]
    at de.tudarmstadt.ukp.inception.project.export.ProjectExportServiceImpl$$EnhancerBySpringCGLIB$$3c74a6e7.exportProject(<generated>) ~[inception-project-export-28.1.jar!/:?]
    at de.tudarmstadt.ukp.inception.project.export.task.backup.BackupProjectExportTask.export(BackupProjectExportTask.java:45) ~[inception-project-export-28.1.jar!/:?]
    at de.tudarmstadt.ukp.inception.project.export.task.backup.BackupProjectExportTask.export(BackupProjectExportTask.java:31) ~[inception-project-export-28.1.jar!/:?]
    at de.tudarmstadt.ukp.inception.project.export.task.ProjectExportTask_ImplBase.run(ProjectExportTask_ImplBase.java:103) [inception-project-export-28.1.jar!/:?]
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
    at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
    at java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: org.apache.uima.analysis_engine.AnalysisEngineProcessException
    at org.dkpro.core.io.xmi.XmiWriter.process(XmiWriter.java:133) ~[dkpro-core-io-xmi-asl-2.3.1.jar!/:?]
    at org.apache.uima.analysis_component.JCasAnnotator_ImplBase.process(JCasAnnotator_ImplBase.java:50) ~[uimaj-core-3.4.1.jar!/:?]
    at org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.lambda$callProcessMethod$3(AnalysisEngineImplBase.java:669) ~[uimaj-core-3.4.1.jar!/:?]
    at org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.withContexts(AnalysisEngineImplBase.java:688) ~[uimaj-core-3.4.1.jar!/:?]
    at org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.callProcessMethod(AnalysisEngineImplBase.java:668) ~[uimaj-core-3.4.1.jar!/:?]
    at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:387) ~[uimaj-core-3.4.1.jar!/:?]
    at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.processAndOutputNewCASes(PrimitiveAnalysisEngine_impl.java:299) ~[uimaj-core-3.4.1.jar!/:?]
    at org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(AnalysisEngineImplBase.java:295) ~[uimaj-core-3.4.1.jar!/:?]
    at de.tudarmstadt.ukp.clarin.webanno.api.format.FormatSupport.write(FormatSupport.java:222) ~[inception-api-formats-28.1.jar!/:?]
    at de.tudarmstadt.ukp.inception.export.DocumentImportExportServiceImpl.exportCasToFile(DocumentImportExportServiceImpl.java:572) ~[inception-export-28.1.jar!/:?]
    at de.tudarmstadt.ukp.inception.export.DocumentImportExportServiceImpl.exportAnnotationDocument(DocumentImportExportServiceImpl.java:269) ~[inception-export-28.1.jar!/:?]
    at de.tudarmstadt.ukp.inception.export.DocumentImportExportServiceImpl$$FastClassBySpringCGLIB$$6bf689d0.invoke(<generated>) ~[inception-export-28.1.jar!/:?]
    at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:218) ~[spring-core-5.3.27.jar!/:5.3.27]
    at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:793) ~[spring-aop-5.3.27.jar!/:5.3.27]
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163) ~[spring-aop-5.3.27.jar!/:5.3.27]
    at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:763) ~[spring-aop-5.3.27.jar!/:5.3.27]
    at org.springframework.transaction.interceptor.TransactionInterceptor$1.proceedWithInvocation(TransactionInterceptor.java:123) ~[spring-tx-5.3.27.jar!/:5.3.27]
    at org.springframework.transaction.interceptor.TransactionAspectSupport.invokeWithinTransaction(TransactionAspectSupport.java:388) ~[spring-tx-5.3.27.jar!/:5.3.27]
    at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:119) ~[spring-tx-5.3.27.jar!/:5.3.27]
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186) ~[spring-aop-5.3.27.jar!/:5.3.27]
    at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:763) ~[spring-aop-5.3.27.jar!/:5.3.27]
    at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:708) ~[spring-aop-5.3.27.jar!/:5.3.27]
    at de.tudarmstadt.ukp.inception.export.DocumentImportExportServiceImpl$$EnhancerBySpringCGLIB$$1a7215eb.exportAnnotationDocument(<generated>) ~[inception-export-28.1.jar!/:?]
    at de.tudarmstadt.ukp.inception.schema.exporters.AnnotationDocumentExporter.exportAdditionalFormat(AnnotationDocumentExporter.java:301) ~[inception-schema-28.1.jar!/:?]
    at de.tudarmstadt.ukp.inception.schema.exporters.AnnotationDocumentExporter.exportAnnotationDocumentContents(AnnotationDocumentExporter.java:239) ~[inception-schema-28.1.jar!/:?]
    at de.tudarmstadt.ukp.inception.schema.exporters.AnnotationDocumentExporter.exportData(AnnotationDocumentExporter.java:140) ~[inception-schema-28.1.jar!/:?]
    at de.tudarmstadt.ukp.inception.project.export.ProjectExportServiceImpl.exportProjectToPath(ProjectExportServiceImpl.java:258) ~[inception-project-export-28.1.jar!/:?]
    ... 22 more
Caused by: org.xml.sax.SAXParseException: Trying to serialize non-XML 1.0 character: 0x3 at offset 2 in string starting with PK
    at org.apache.uima.util.XMLSerializer$CharacterValidatingContentHandler.checkForInvalidXmlChars(XMLSerializer.java:429) ~[uimaj-core-3.4.1.jar!/:?]
    at org.apache.uima.util.XMLSerializer$CharacterValidatingContentHandler.startElement(XMLSerializer.java:297) ~[uimaj-core-3.4.1.jar!/:?]
    at org.apache.uima.cas.impl.XmiCasSerializer$XmiDocSerializer.startElement(XmiCasSerializer.java:1312) ~[uimaj-core-3.4.1.jar!/:?]
    at org.apache.uima.cas.impl.XmiCasSerializer$XmiDocSerializer.writeFsOrLists(XmiCasSerializer.java:816) ~[uimaj-core-3.4.1.jar!/:?]
    at org.apache.uima.cas.impl.XmiCasSerializer$XmiDocSerializer.writeFs(XmiCasSerializer.java:802) ~[uimaj-core-3.4.1.jar!/:?]
    at org.apache.uima.cas.impl.CasSerializerSupport$CasDocSerializer.encodeFS(CasSerializerSupport.java:1312) ~[uimaj-core-3.4.1.jar!/:?]
    at org.apache.uima.cas.impl.CasSerializerSupport$CasDocSerializer.encodeQueued(CasSerializerSupport.java:1208) ~[uimaj-core-3.4.1.jar!/:?]
    at org.apache.uima.cas.impl.XmiCasSerializer$XmiDocSerializer.writeFeatureStructures(XmiCasSerializer.java:661) ~[uimaj-core-3.4.1.jar!/:?]
    at org.apache.uima.cas.impl.CasSerializerSupport$CasDocSerializer.serialize(CasSerializerSupport.java:563) ~[uimaj-core-3.4.1.jar!/:?]
    at org.apache.uima.cas.impl.XmiCasSerializer.serialize(XmiCasSerializer.java:506) ~[uimaj-core-3.4.1.jar!/:?]
    at org.dkpro.core.io.xmi.XmiWriter.process(XmiWriter.java:124) ~[dkpro-core-io-xmi-asl-2.3.1.jar!/:?]
    at org.apache.uima.analysis_component.JCasAnnotator_ImplBase.process(JCasAnnotator_ImplBase.java:50) ~[uimaj-core-3.4.1.jar!/:?]
    at org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.lambda$callProcessMethod$3(AnalysisEngineImplBase.java:669) ~[uimaj-core-3.4.1.jar!/:?]
    at org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.withContexts(AnalysisEngineImplBase.java:688) ~[uimaj-core-3.4.1.jar!/:?]
    at org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.callProcessMethod(AnalysisEngineImplBase.java:668) ~[uimaj-core-3.4.1.jar!/:?]
    at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:387) ~[uimaj-core-3.4.1.jar!/:?]
    at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.processAndOutputNewCASes(PrimitiveAnalysisEngine_impl.java:299) ~[uimaj-core-3.4.1.jar!/:?]
    at org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(AnalysisEngineImplBase.java:295) ~[uimaj-core-3.4.1.jar!/:?]
    at de.tudarmstadt.ukp.clarin.webanno.api.format.FormatSupport.write(FormatSupport.java:222) ~[inception-api-formats-28.1.jar!/:?]
    at de.tudarmstadt.ukp.inception.export.DocumentImportExportServiceImpl.exportCasToFile(DocumentImportExportServiceImpl.java:572) ~[inception-export-28.1.jar!/:?]
    at de.tudarmstadt.ukp.inception.export.DocumentImportExportServiceImpl.exportAnnotationDocument(DocumentImportExportServiceImpl.java:269) ~[inception-export-28.1.jar!/:?]
    at de.tudarmstadt.ukp.inception.export.DocumentImportExportServiceImpl$$FastClassBySpringCGLIB$$6bf689d0.invoke(<generated>) ~[inception-export-28.1.jar!/:?]
    at org.springframework.cglib.proxy.MethodProxy.invoke(MethodProxy.java:218) ~[spring-core-5.3.27.jar!/:5.3.27]
    at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.invokeJoinpoint(CglibAopProxy.java:793) ~[spring-aop-5.3.27.jar!/:5.3.27]
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163) ~[spring-aop-5.3.27.jar!/:5.3.27]
    at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:763) ~[spring-aop-5.3.27.jar!/:5.3.27]
    at org.springframework.transaction.interceptor.TransactionInterceptor$1.proceedWithInvocation(TransactionInterceptor.java:123) ~[spring-tx-5.3.27.jar!/:5.3.27]
    at org.springframework.transaction.interceptor.TransactionAspectSupport.invokeWithinTransaction(TransactionAspectSupport.java:388) ~[spring-tx-5.3.27.jar!/:5.3.27]
    at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:119) ~[spring-tx-5.3.27.jar!/:5.3.27]
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186) ~[spring-aop-5.3.27.jar!/:5.3.27]
    at org.springframework.aop.framework.CglibAopProxy$CglibMethodInvocation.proceed(CglibAopProxy.java:763) ~[spring-aop-5.3.27.jar!/:5.3.27]
    at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:708) ~[spring-aop-5.3.27.jar!/:5.3.27]
    at de.tudarmstadt.ukp.inception.export.DocumentImportExportServiceImpl$$EnhancerBySpringCGLIB$$1a7215eb.exportAnnotationDocument(<generated>) ~[inception-export-28.1.jar!/:?]
    at de.tudarmstadt.ukp.inception.schema.exporters.AnnotationDocumentExporter.exportAdditionalFormat(AnnotationDocumentExporter.java:301) ~[inception-schema-28.1.jar!/:?]
    at de.tudarmstadt.ukp.inception.schema.exporters.AnnotationDocumentExporter.exportAnnotationDocumentContents(AnnotationDocumentExporter.java:239) ~[inception-schema-28.1.jar!/:?]
    at de.tudarmstadt.ukp.inception.schema.exporters.AnnotationDocumentExporter.exportData(AnnotationDocumentExporter.java:140) ~[inception-schema-28.1.jar!/:?]
    at de.tudarmstadt.ukp.inception.project.export.ProjectExportServiceImpl.exportProjectToPath(ProjectExportServiceImpl.java:258) ~[inception-project-export-28.1.jar!/:?]
    ... 22 more
reckart commented 1 year ago

If you export the file 12628561_ootc_sotomayor.txt individually e.g. from the annotation page as UIMA CAS XMI (XML 1.0), you should see the same error.

If you download the file as a plain text file and open it in a hex editor, you should see that the third byte in the data is control character 0x03.

nschneid commented 1 year ago

I'm able to export that file just fine in either XML 1.0 or 1.1:

2023-06-09 20:23:50 INFO [SYSTEM] DocumentImportExportServiceImpl - Exported annotations [12628561_ootc_sotomayor.txt](2) for user [admin] from project [CuRIAM Agreement Study](0) using format [xmi]
2023-06-09 20:24:36 INFO [SYSTEM] DocumentImportExportServiceImpl - Exported annotations [12628561_ootc_sotomayor.txt](2) for user [admin] from project [CuRIAM Agreement Study](0) using format [xmi-xml1.1]
reckart commented 1 year ago

That is very interesting since the code used to export the document should be the same in both instances. I wonder if you could share a project export privately with me for investigation? (Exported using "no secondary format").

Btw. does the document text actually start with PK

nschneid commented 1 year ago

@reckart Sent you the file.

I'm not sure where the PK comes from—wondering if it means "primary key".

reckart commented 1 year ago

The PK comes from the header of a ZIP file. The project contains a 12628561_ootc_sotomayor.zip document in addition to the 12628561_ootc_sotomayor.txt. The error you see is generated when INCEpTION tries to export this ZIP file into a CAS because ZIP files are binary files and typically contain characters which are not legal XML 1.0/1.1 characters.

Removing the ZIP file from your documents lists fixes the problem.

nschneid commented 1 year ago

Interesting. The ZIP file was an export of a file that I reimported to try to test the curation mode. Do you know why the import didn't properly unpack the ZIP file?

reckart commented 1 year ago

If you export a project as ZIP, you need to import that project through the project overview page again, not as a document.

If you export a document as XMI, it comes down as a ZIP too - but you cannot import that ZIP back in directly. For uploading an XMI file, you'd have to unzip the file and only upload the .xmi file. Also, you'd have to choose the proper input format - which in your case was Plain text for the ZIP file and not CAS XMI. If you had chosen CAS XMI while importing the ZIP, INCEpTION would directly have issued an error because a ZIP file cannot be read as a CAS XMI file.