inception-project / inception

INCEpTION provides a semantic annotation platform offering intelligent annotation assistance and knowledge management.
https://inception-project.github.io
Apache License 2.0
593 stars 151 forks source link

DocumentMetaData not included when exporting features #3615

Closed giuliabaldini closed 1 year ago

giuliabaldini commented 1 year ago

Describe the bug

Hi there,

as described #3605, we are trying to export the INCEpTION TypeSystem such that it allows UIMA subtypes, which would allow us to postprocess the data more easily. This is currently not possible, but we tried a workaround.

To Reproduce

We did the following:

  1. Created a new project
  2. Went to "Layers" a) Selected one layer b) Export -> UIMA (all layers) -> Export c) We get this TypeSystem, as you can see, there is no "DocumentMetadata". I had to convert it to .txt because GitHub would not allow me to upload it.
  3. We used this TypeSystem to build our own TypeSystem. We did not delete anything from it, we just added new types inheriting from NamedEntity, hoping that we could then use this newly created TypeSystem for postprocessing. Code below.
import cassis
from pathlib import Path

p = Path("bugreport")
NE = "de.tudarmstadt.ukp.dkpro.core.api.ner.type.NamedEntity"
typesystem = cassis.load_typesystem(p / "typesystem.xml")
ner_type = typesystem.get_type(NE)
t = typesystem.create_type(
    name="A", supertypeName=ner_type.name
)
typesystem.create_feature(
    domainType=t, name="b", rangeType=cassis.typesystem.TYPE_NAME_STRING
)
typesystem.to_xml(p / "ModTypeSystem.xml")
  1. We imported this TypeSystem in INCEpTION, which resulted "A" to appear in the list of layers, with "b" as feature.
  2. We imported at random document into INCEpTION and did some random annotations
  3. Now, we exported the document (as CAS XML 1.0), which exports both the document in CAS format (text.txt had to convert it to txt to upload it) and the TypeSystem.txt. Now, the TypeSystem has the DocumentMetadata type, which is not present in the original TypeSystem.
    
    downloaded_location = p / "text"
    ts = cassis.load_typesystem(downloaded_location / "TypeSystem.xml")
    c = cassis.load_cas_from_xmi(downloaded_location / "text.xmi", ts)
    print("Works for newly created typesystem, but does not have subtypes")
    print(c, c.select(NE))

ts_old = cassis.load_typesystem(Path("ModTypeSystem.xml")) print("Does not work for the original modified typesystem, which has subtypes") c = cassis.load_cas_from_xmi(downloaded_location / "text.xmi", ts_old)

Output

Works for newly created typesystem, but does not have subtypes <cassis.cas.Cas object at 0x10c46cdf0> [] Does not work for the original modified typesystem, which has subtypes Traceback (most recent call last): ... cassis.typesystem.TypeNotFoundError: Type with name [de.tudarmstadt.ukp.dkpro.core.api.metadata.type.DocumentMetaData] not found!



The idea was to use the TypeSystem that we created for the postprocessing, but since it seems like some elements are added to the TypeSystem at export time, this is currently not possible. 

### Expected behavior

The first exported TypeSystem should have all the needed types.

### Screenshots

_No response_

### Environment

Version and build ID: 26.1 (2022-11-29 18:43:46, build eb26b57e)
Operating system: OSX
Browser: Chrome

### Additional context

I have not tried yet, but I believe that modifying the TypeSystem _after_ downloading would probably fix it, which would make our workflow a bit less nice, but it's still possible. I am not sure if this is a bug, or if this is actually the expected behaviour, so thank you in advance anyways for your time!

Best,

Giulia & @karzideh
reckart commented 1 year ago

Ok, so here is what I did:

In your description of the process, I don't see that you actually created a document metadata layer. Did you create one?

reckart commented 1 year ago

Never mind - I should have read the report mode closely. I see now that you are referring to a particular DKPro Core type that is missing.

reckart commented 1 year ago

Where did you get the type system from step 6 from (i.e. the one that contains the DKPro Core DocumentMetaData type?

Well, if you have an XMI file that contains a DKPro Core DocumentMetaData annotation, then you'd have to copy that type definition over to the modified type system. You could to this manually or programmatically. Cf. e.g. https://cassis.readthedocs.io/en/latest/_modules/cassis/typesystem.html#merge_typesystems

reckart commented 1 year ago

When exporting the UIMA type system, INCEpTION only exports the types related to the layers defined in the project. If we did not do that, the type system would be spammed by tons of DKPro Core types (such as the one you are missing) and the type system file would be considerably larger. INCEpTION does not use most of the DKPro Core types though.

We could introduce a second export option to export a full UIMA type system that includes all types INCEpTION knows about - even the ones that it does't use.

giuliabaldini commented 1 year ago

Where did you get the type system from step 6 from (i.e. the one that contains the DKPro Core DocumentMetaData type?

image I just exported this from the document view. If you press that, you can choose the format, and if you select CAS you get a TypeSystem.xml and the actual file.

Well, if you have an XMI file that contains a DKPro Core DocumentMetaData annotation, then you'd have to copy that type definition over to the modified type system. You could to this manually or programmatically. Cf. e.g.

Yes, this would definitely be an option, and it would happen after we have downloaded the annotated documents. I was just wondering why the other TypeSystem did not have all the types.

When exporting the UIMA type system, INCEpTION only exports the types related to the layers defined in the project. If we did not do that, the type system would be spammed by tons of DKPro Core types (such as the one you are missing) and the type system file would be considerably larger. INCEpTION does not use most of the DKPro Core types though.

We could introduce a second export option to export a full UIMA type system that includes all types INCEpTION knows about - even the ones that it does't use.

The question is: Is the "de.tudarmstadt.ukp.dkpro.core.api.metadata.type.DocumentMetaData" always added when exporting a document?

reckart commented 1 year ago

It seems we have an inconsistency in the implementation here. The type system export from the layer settings only includes the layers defined in the project settings. However, when exporting via the functionality to export individual documents or even the entire project, all types that INCEpTION has access to are included - even if they are not defined as project layers.

reckart commented 1 year ago

The question is: Is the "de.tudarmstadt.ukp.dkpro.core.api.metadata.type.DocumentMetaData" always added when exporting a document?

Ok, so finally coming back to this.

Yes, INCEpTION always adds de.tudarmstadt.ukp.dkpro.core.api.metadata.type.DocumentMetaData when exporting a document. Even if it already exists, it is overwritten by INCEpTION, e.g. setting the documentId to the document's filename.

However, this is not documented atm and it might change in the future.

reckart commented 1 year ago

I believe you should have no issue if you just copy the definition of the de.tudarmstadt.ukp.dkpro.core.api.metadata.type.DocumentMetaData over to the type system you are using to load the data in cassis. Alternatively, you could load the CAS leniently using DKPro Cassis - that would drop any annotations not in the target CAS type system.