inception-project / inception

INCEpTION provides a semantic annotation platform offering intelligent annotation assistance and knowledge management.
https://inception-project.github.io
Apache License 2.0
593 stars 151 forks source link

Cannot import a UIMA CAS TypeSystem definition (more specifically, cTakes core typesystem definition) #2868

Closed pascal-vaillant closed 2 years ago

pascal-vaillant commented 2 years ago

Describe the bug Hello INCEpTION team ! I am trying to import annotations generated by the Apache cTakes "clinical pipeline" system (an annotation platform for biomedical texts in English) into INCEpTION, to be able to view them as a set of annotation layers. Apache cTakes is built on UIMA and uses an UIMA CAS XML type system definition. However, I can only partly import the XML type system description of cTakes. I include (below) a very minimal example of an annotation type that does not show up in INCEpTION.

To Reproduce Steps to reproduce the behavior:

  1. create a Test project.
  2. Click on 'Settings', then 'Layers'
  3. Try to import the minimal example (copied below). It contains the definition of a type 'Lemma' (not with the same prefix as the INCEpTION embedded Lemma type), that inherits from uima.cas.TOP, then of a type 'BaseToken' that inherits from uima.tcas.Annotation, and has 2 features with basic uima.cas simple types, and one feature which is a FSlist of objects of type 'Lemma'.

Expected behavior The expected behaviour is that 'BaseToken' should appear as a new Layer in the list of Layers. However, it does not.

Please complete the following information:

XML source of the minimal example

<?xml version="1.0" encoding="UTF-8"?>
<typeSystemDescription xmlns="http://uima.apache.org/resourceSpecifier">

  <name>org.apache.ctakes.typesystem.types.TypeSystem</name>
  <description>This is a Apache cTAKES Common Type System for clinical NLP. It includes general types necessary to store annotations and interface with clinical element models</description>
  <version>1.0</version>
  <vendor>Apache cTAKES</vendor>

  <types>
     <typeDescription>
      <name>org.apache.ctakes.typesystem.type.syntax.Lemma</name>
      <description>Stores a lemma (canonical form of a token).  Inherits from uima.cas.TOP, allowing for reuse of standardized forms across the CAS. 
Equivalent to cTAKES: edu.mayo.bmi.uima.core.type.Lemma</description>
      <supertypeName>uima.cas.TOP</supertypeName>
      <features>
        <featureDescription>
          <name>key</name>
          <description/>
          <rangeTypeName>uima.cas.String</rangeTypeName>
        </featureDescription>
        <featureDescription>
          <name>posTag</name>
          <description/>
          <rangeTypeName>uima.cas.String</rangeTypeName>
        </featureDescription>
      </features>
     </typeDescription>
         <typeDescription>
      <name>org.apache.ctakes.typesystem.type.syntax.BaseToken</name>
      <description>A supertype for tokens subsuming word, punctuation, symbol, newline, contraction, or number.  Includes parts of speech, which are grammatical categories, e.g., noun (NN) or preposition (IN) that use Penn Treebank tags with a few additions.
Equivalent to cTAKES: edu.mayo.bmi.uima.core.type.BaseToken</description>
      <supertypeName>uima.tcas.Annotation</supertypeName>
      <features>
        <featureDescription>
          <name>tokenNumber</name>
          <description/>
          <rangeTypeName>uima.cas.Integer</rangeTypeName>
        </featureDescription>
        <featureDescription>
          <name>normalizedForm</name>
          <description/>
          <rangeTypeName>uima.cas.String</rangeTypeName>
        </featureDescription>
        <featureDescription>
          <name>partOfSpeech</name>
          <description/>
          <rangeTypeName>uima.cas.String</rangeTypeName>
        </featureDescription>
        <featureDescription>
          <name>lemmaEntries</name>
          <description/>
          <rangeTypeName>uima.cas.FSList</rangeTypeName>
          <elementType>org.apache.ctakes.typesystem.type.syntax.Lemma</elementType>
        </featureDescription>
      </features>
    </typeDescription>
  </types>
</typeSystemDescription>

Screenshot_20220223_150203

reckart commented 2 years ago

INCEpTION does not support all of UIMAs types - e.g. FSList is not supported. Probably the whole type is ignored if a particular feature is not supported. Could you please try removing the FSList feature and try again?

pascal-vaillant commented 2 years ago

Thanks Richard. I have tried changing FSList with FSArray (since I noticed that FSArray was used by INCEpTION's embedded types, to build a list of SemArgs), but the result is the same. Pascal

Le mer. 23 févr. 2022 à 15:27, Richard Eckart de Castilho < @.***> a écrit :

INCEpTION does not support all of UIMAs types - e.g. FSList is not supported. Probably the whole type is ignored if a particular feature is not supported. Could you please try removing the FSList feature and try again?

— Reply to this email directly, view it on GitHub https://github.com/inception-project/inception/issues/2868#issuecomment-1048837060, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGIJT2WWAFGDVVCYHC5BHWDU4TVGRANCNFSM5PEPNMNQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you authored the thread.Message ID: @.***>

reckart commented 2 years ago

When I set the log level for de.tudarmstadt.ukp.clarin.webanno.api.annotation.util.TypeSystemAnalysis to TRACE, I can see this in the log:

TypeSystemAnalysis - Analyzing [org.apache.ctakes.typesystem.type.syntax.Lemma]
TypeSystemAnalysis - [org.apache.ctakes.typesystem.type.syntax.Lemma] is not an annotation type. Skipping.
TypeSystemAnalysis - Analyzing [org.apache.ctakes.typesystem.type.syntax.BaseToken]
TypeSystemAnalysis - Unable to determine layer type for [org.apache.ctakes.typesystem.type.syntax.BaseToken]
TypeSystemAnalysis - Recognized 0 of 2 types as layers (0%)

When I remove the FSList feature from the org.apache.ctakes.typesystem.type.syntax.BaseToken, it is recognized. When I change the super-type of the lemma type to uima.tcas.Annotation, it is also recognized.

Mind that INCEpTION only makes use of tokens of the type de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token - if no tokens of that type exist in the CAS, then INCEpTION creates them. It also creates de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Sentence annotations if none exist.

pascal-vaillant commented 2 years ago

Hi, Yes. The point is to be able to see "tokens" and "sentences" prefixed with org.apache.ctakes.typesystem.type, not to replace those prefixed by de.tudarmstadt.ukp.dkpro.core.api.*.type. Apache cTakes generates XMI CAS files including annotations, but also objects that are not in the source texts ("lemmas" belong to that type, but also medical ontology concepts for example). Then annotations may refer to those objects. I will try to fiddle with the cTakes type system to see how far I can twist it to import it into INCEpTION, and if I manage it I will post it here in case it is of use to other users. But perhaps this is not the right way to go (perhaps it is better to write a script that transforms "cTakes" tokens and sentences into "inception" tokens and sentences ? In the meantime I still do not understand why I cannot import the minimal type system XML file attached here (with "FSList" replaced by "FSArray"). Thanks a lot for your help anyway ! Pascal

Le mer. 23 févr. 2022 à 21:08, Richard Eckart de Castilho < @.***> a écrit :

TypeSystemAnalysis - Analyzing [org.apache.ctakes.typesystem.type.syntax.Lemma] TypeSystemAnalysis - [org.apache.ctakes.typesystem.type.syntax.Lemma] is not an annotation type. Skipping. TypeSystemAnalysis - Analyzing [org.apache.ctakes.typesystem.type.syntax.BaseToken] TypeSystemAnalysis - Unable to determine layer type for [org.apache.ctakes.typesystem.type.syntax.BaseToken] TypeSystemAnalysis - Recognized 0 of 2 types as layers (0%)

When I remove the FSList feature from the org.apache.ctakes.typesystem.type.syntax.BaseToken, it is recognized. When I change the super-type of the lemma type to uima.tcas.Annotation, it is also recognized.

Mind that INCEpTION only makes use of tokens of the type de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Token - if no tokens of that type exist in the CAS, then INCEpTION creates them. It also creates de.tudarmstadt.ukp.dkpro.core.api.segmentation.type.Sentence annotations if none exist.

— Reply to this email directly, view it on GitHub https://github.com/inception-project/inception/issues/2868#issuecomment-1049168453, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGIJT2SBDLJH6B7GQVT3GSTU4U5D7ANCNFSM5PEPNMNQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you authored the thread.Message ID: @.***>

reckart commented 2 years ago

FSArray and FSList are not generically supported by INCEpTION. If an annotation type meets the INCEpTION conventions for a "Relation" layer, it is recognized as such, otherwise types are considered to be "Span" layers and must inherit from the UIMA Annotation type. The primitive UIMA types boolean, integer, float and string are supported as features. If a type follows the conventions for a "Link feature", it is recognized as such. When #2862 is done, then StringArray will be supported as well.

reckart commented 2 years ago

You might find cassis interesting if you want to transform cTAKES data into something more suitable for INCEpTION. It is a convenient Python library for working with XMI files.

pascal-vaillant commented 2 years ago

Thanks for the tip ! Pascal

Le jeu. 24 févr. 2022 à 08:43, Richard Eckart de Castilho < @.***> a écrit :

You might find cassis https://github.com/dkpro/dkpro-cassis interesting if you want to transform cTAKES data into something more suitable for INCEpTION. It is a convenient Python library for working with XMI files.

— Reply to this email directly, view it on GitHub https://github.com/inception-project/inception/issues/2868#issuecomment-1049578035, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGIJT2XHF4PVM3RDLJLYEATU4XOSVANCNFSM5PEPNMNQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you authored the thread.Message ID: @.***>