Part-of-Speech tagging on a single letter in a word - Arabic

inception-project / inception

INCEpTION provides a semantic annotation platform offering intelligent annotation assistance and knowledge management.

https://inception-project.github.io

Apache License 2.0

596 stars 152 forks source link

Part-of-Speech tagging on a single letter in a word - Arabic #1960

Closed Kentoseth closed 3 years ago

Kentoseth commented 3 years ago

This might not be a bug and I may have merely misunderstood the system. I have looked through the documentation and other issues and found no reference to this. The only other reference was: https://github.com/inception-project/inception/issues/1609

But that issue is multiple POS tags on the same token. In morphologically rich languages like Arabic, you can get a sentence like this:

هَذا كِتَابُكَ

This is your book

هَذا = This كِتَابُ = book (noun) كَ = your(male) (pronoun)

Is it possible to add POS tags on a single letter? In this instance, I wish to add 2 POS tags to: كِتَابُكَ (which is a single word but the 2 tags do not belong on the same letters)

I listed the POS tags above to make it clear what the problem is.

What I tried

I attempted to highlight just a single letter while under the POS layer, but it ends up highlighting the entire word. I tried other layers for this too, but the POS layer is the most appropriate for this as these are POS concepts.

reckart commented 3 years ago

The built-in POS layer can't do that. But that doesn't need to stop you.

You can define a custom span layer and configure it for character-level granularity if you want to assign tags at the sub-token level. You can also change the overlap setting for the layer to allow stacking if you need to assign multiple tags at the same position.

Note:

curation and agreement and such do not like stacking. If you stack annotations, they'll largely ignore them. So if you can avoid stacking and stick with non-stacking sub-token annotations, you're good.
also if you define your own span layers and you want to export them later, you need to use a format which supports custom layers (e.g. UIMA XMI or WebAnno TSV).

Kentoseth commented 3 years ago

Thank you for the quick reply!

I went with the first option and created a custom span layer and configure it for character-level granularity. I did the same for the Dependency layer. I ran into 1 problem though. Please see the screenshot below:

Annotation-bug

The dependency label is covering up the labels of the PoS layer. Is there a way to adjust the visual location of these labels(of the layers) so that they don't cross over each other?

My second question is in regards to:

Corpus

For reference to this, you can see here

The Quran corpus has a morphology section below the PoS/Dependency graphs. Although I don't want the exact same visual layout, I would like to add greater morphology details too. Which layer is most suitable for that?

reckart commented 3 years ago

Puh... so I tried character-level annotation and relations between them in Safari and in Chrome on the current 0.18.0-SNAPSHOT. In Safari, I seem to have trouble with selecting the proper part of the words - it seems to be mirrored. E.g. if I select a character at the end of the word, it instead annotates at the beginning and so on. In Chrome, it seems to work better. That said, I both, I do not seem to have the trouble with the overlapping that you have...

You can add descriptions to your tagsets. But these would only show up when you edit the annotation and hover with the mouse over a particular tag in the tag dropdown.

Instead of using a tagset, you could theoretically create a knowledge base and put your controlled vocabulary in there. Descriptions stored in a KB are shown on-mouse-over on the annotations:

Hm.... I guess we could consider showing tagset descriptions in the on-mouse-over popup on annotations... it is a bit of an overkill to have to resort to using a knowledge base just to get these...

reckart commented 3 years ago

Which version and which browser are you using?

reckart commented 3 years ago

As for which layer for morphology: you should define your down. There is a pre-defined layer for morphological information, but it only supports a single string feature. You would probably want to define your own layer with different features representing the different morphological properties (numerus / casus / ...).

Kentoseth commented 3 years ago

In Chrome, it seems to work better. That said, I both, I do not seem to have the trouble with the overlapping that you have... Which version and which browser are you using?

The browser was FireFox and the version being used is the demo/test: INCEpTION -- 0.17.4

Instead of using a tagset, you could theoretically create a knowledge base and put your controlled vocabulary in there. Descriptions stored in a KB are shown on-mouse-over on the annotations:

Was this solution for the morphology? I was confused because you mentioned another solution for the morphology below.

You would probably want to define your own layer with different features representing the different morphological properties (numerus / casus / ...).

Should this also be a span type?

Regarding the built-in Dependency layer, it has a Feature called Flavor. What is the purpose of this feature? I could not understand what the difference between Basic/Enhanced was for these 2.

reckart commented 3 years ago

Was this solution for the morphology? I was confused because you mentioned another solution for the morphology below.

I assume you mean my suggestion to create a custom "Morphology" span layer. You can model the different morphological features on such a layer using features of type "Primitive: String" (using tagsets) or of type "KB: Concept / Property / Instance" (linking to a knowledge base).

Should this also be a span type?

A "Morphology" layer for annotation morphological features would be a span layer, indeed.

Regarding the built-in Dependency layer, it has a Feature called Flavor. What is the purpose of this feature? I could not understand what the difference between Basic/Enhanced was for these 2.

This is borrowed from the way that the Universal Dependencies project model their dependency annotations: https://universaldependencies.org/u/overview/enhanced-syntax.html

Kentoseth commented 3 years ago

Thank you for the guidance. I will close this issue now.