proycon / folia

FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are supported, making FoLiA a useful format for NLP tasks and data interchange. Note that the actual Python library for processing FoLiA is implemented as part of PyNLPl, this contains higher-level tools that use the library as well as the full documentation, validation schemas, and set definitions
http://proycon.github.io/folia/
GNU General Public License v3.0
60 stars 10 forks source link

comprehensive linguistic annotation #105

Open osherenko opened 1 year ago

osherenko commented 1 year ago

I am developing an annotation and would specify for particular words not only lemmas and POS, but also etymologic or morphological information. How should I do it?

proycon commented 1 year ago

For morphological information, check the respective section in the documentation:Morphological annotation

For data on etymology there is no specific element, and I'd probably need to see an example of how such an annotation looks like in your data to give the best advice. It might be added as a higher-order feature using <feat> with one of the existing types.

osherenko commented 1 year ago

How would you annotate the morphology of the irregular verb "went"? "go" is not a part of it. BTW, there is a broken link to the API class -- https://folia.readthedocs.io/en/latest/morphological_annotation.html doesn't exist.

I want to specify etymological data within a word annotation as its definition from a dictionary, similar to the annotation of the phoneme in an utterance.

<utt xml:id="example.utt.1" src="helloworld.mp3"

begintime="00:00:01.000" endtime="00:00:02.000">

helˈoʊ wɝːld
    <w xml:id="example.utt.1.w.1" begintime="00:00:00.000"

endtime="00:00:01.000">

helˈoʊ
        <etymology>early 19th century: variant of earlier hollo ;

related to holla. <w xml:id="example.utt.1.w.2" begintime="00:00:01.000" endtime="00:00:02.000">

wɝːld
    </w>
</utt>

Moreover, I want to specify linguistic information in the sentence annotation such as dependencies and the grammatical parse. For example, for sentence annotation of "The strongest rain ever recorded in India..." https://nlp.stanford.edu/software/lex-parser.shtml#Sample should include the grammatical tree

(ROOT (S (S (NP (NP (DT The) (JJS strongest) (NN rain)) (VP (ADVP (RB ever)) (VBN recorded) (PP (IN in) (NP (NNP India)))))

and dependencies:

det(rain-3, The-1) amod(rain-3, strongest-2) nsubj(shut-8, rain-3) nsubj(snapped-16, rain-3) nsubj(closed-20, rain-3) nsubj(forced-23, rain-3) advmod(recorded-5, ever-4) partmod(rain-3, recorded-5) prep_in(recorded-5, India-7)

Am Mi., 9. Nov. 2022 um 20:50 Uhr schrieb Maarten van Gompel < @.***>:

For morphological information, check the respective section in the documentation:Morphological annotation https://folia.readthedocs.io/en/latest/morphological_annotation.html

For data on etymology there is no specific element, and I'd probably need to see an example of how such an annotation looks like in your data to give the best advice. It might be added as a higher-order feature using with one of the existing types.

— Reply to this email directly, view it on GitHub https://github.com/proycon/folia/issues/105#issuecomment-1309276301, or unsubscribe https://github.com/notifications/unsubscribe-auth/A4B2N3PYFLTITQVHL6FO4F3WHP56XANCNFSM6AAAAAARZQV3PU . You are receiving this because you authored the thread.Message ID: @.***>

osherenko commented 1 year ago

Probably, the etymology annotation can resemble the sense annotation https://folia.readthedocs.io/en/latest/sense_annotation.html

proycon commented 1 year ago

How would you annotate the morphology of the irregular verb "went"? "go" is not a part of it.

True, "go" would be the lemma, in the simplest annotation form:

<w>
  <t>went</t>
  <lemma class="go" />
</w>

If you want to express it in a morphological structure, you could do the following:

<w>
  <t>went</t>
  <lemma class="go" />
  <morphology>
    <morpheme class="stem">
       <t>went</t>
       <lemma class="go" />
       <pos class="V" />
    </morpheme>
  </morphology>
</w>

There's a bit of duplication here to show that you can choose on what level you want to express certain things (like pos/lemma/sense). Most of what you can express on the word level is also valid on the morpheme level.

Note: the classes belong to a user-defined set definition, FoLiA itself does not prescribe them.

There are some further examples on https://folia.readthedocs.io/en/latest/morphological_annotation.html , which shows that you can also do nesting and associate extra features with morphemes (such as function, or any other you can invent). This is also the place where you might want to express etymology:

<w>
  <t>went</t>
  <lemma class="go" />
  <morphology>
    <morpheme class="stem">
       <t>went</t>
       <lemma class="go" />
       <pos class="V" />
       <feat subset="etymology" class="wenden" />
    </morpheme>
  </morphology>
</w>

In this case it's an extra feature on the morpheme. Rather than using the full description I opted for a shorted 'class' here, which ideally some external database would provide a full definition of (and the set of your morphology-annotation would determine what that database is). But of course you may also just put the full definition as class.

Moreover, I want to specify linguistic information in the sentence annotation such as dependencies and the grammatical parse.

That can be accommodated in FoLiA by respectively dependency annotation and syntactic annotation.

BTW, there is a broken link to the API class -- https://folia.readthedocs.io/en/latest/morphological_annotation.html doesn't exist.

Oops indeed, thanks! I have corrected the mistake.

I hope this provides some more clarity?

proycon commented 1 year ago

One of the points arising from this issue is whether we want to introduce an explicit <etymology> element in FoLiA, It would be an inline annotation element very alike to <sense>. We could then do <etymology class="wenden" /> . Which may be nicer than using a feature structure.

osherenko commented 1 year ago

An explicit etymology tag is indeed nicer. Another question is about multimodal injections in annotations similar to Example 1.7.1 in the documentation.

Can FoliA accept multimodal injections? You spoke about OCR corrections. My annotation should annotate texts not only lexically, but also statistically, semantically, or cognitively. It should also consider properties of the original image used by the OCR to extract the texts.

Another issue is the annotation of a corpus containing several texts. In this case, each text tag should hold the source of a text like the src entity in the Utterance annotation and be saved in the output XML file. Am Mo., 14. Nov. 2022 um 14:00 Uhr schrieb Maarten van Gompel < @.***>:

One of the points arising from this issue is whether we want to introduce an explicit element in FoLiA, It would be an inline annotation element very alike to . We could then do <etymology class="wenden" /> . Which may be nicer than using a feature structure.

— Reply to this email directly, view it on GitHub https://github.com/proycon/folia/issues/105#issuecomment-1313653003, or unsubscribe https://github.com/notifications/unsubscribe-auth/A4B2N3OXYMRCSXIVXIN6LMTWIIZVDANCNFSM6AAAAAARZQV3PU . You are receiving this because you authored the thread.Message ID: @.***>

osherenko commented 1 year ago

It might be a good idea to use Metric annotation, For example: \ \ \"

proycon commented 1 year ago

An explicit etymology tag is indeed nicer

I'll work on adding one.

Another issue is the annotation of a corpus containing several texts.

Just do one FoLiA document per text. It's not recommended to put the whole corpus in a single FoLiA file.

Can FoliA accept multimodal injections?

You can have multiple text layers and multiple phonological layers, yes, if that's what you mean. One limitation in FoLiA, however, is that there can only be one canonical tokenization for everything.

It might be a good idea to use Metric annotation

<text src="...">
<metric class="charlength" value="4" />
</text>

I'm not entirely sure what your exact use case is for this, but you can indeed use metric annotation for all kinds of measurements on whatever you want.

It should also consider properties of the original image used by the OCR to extract the texts.

proycon commented 1 year ago

@osherenko Please check what you think of the example in the above commit d43bbc8 and the documentation in commit db5499f . That's my proposed solution for annotating etymology in FoLiA.

osherenko commented 1 year ago

The etymology example looks very nice and "set" is a great addition (can it be empty if the set is unclear?)

Could you explain why "It's not recommended to put the whole corpus in a single FoLiA file"? If this case, each FoliA file needs a descriptive name and there are many FoliA annotations in separate files, what can be problematic because the file names must be cross-platform. Moreover, I wonder why the Statement and the Utterance annotations have the src-entity and the text tag not? Actually, in the current implementation, I can have several doc tags in a single FoliA file and add the src entity to the doc tag. Unfortunately, If I save such annotation, the src entity is not present in the XML output.

BTW, I am particularly interested in the PyNLPI library. Do you have a special mailing group for questions?

Am Fr., 18. Nov. 2022 um 16:04 Uhr schrieb Maarten van Gompel < @.***>:

@osherenko https://github.com/osherenko Please check what you think of the example in the above commit d43bbc8 https://github.com/proycon/folia/commit/d43bbc848dc83929072fc37a1672391173dc2dff and the documentation in commit db5499f https://github.com/proycon/folia/commit/db5499fa802a0c398163ea0f17c541871780aae6 . That's my proposed solution for annotation etymology in FoLiA.

— Reply to this email directly, view it on GitHub https://github.com/proycon/folia/issues/105#issuecomment-1320143018, or unsubscribe https://github.com/notifications/unsubscribe-auth/A4B2N3MONOBZKFK5Q6HKG23WI6LH3ANCNFSM6AAAAAARZQV3PU . You are receiving this because you were mentioned.Message ID: @.***>

proycon commented 1 year ago

The etymology example looks very nice and "set" is a great addition (can it be empty if the set is unclear?)

Yes, in that case an implicit set is assigned internally. The whole 'set' concept is at the core of FoLiA's paradigm.

Could you explain why "It's not recommended to put the whole corpus in a single FoLiA file"? If this case, each FoliA file needs a descriptive name and there are many FoliA annotations in separate files, what can be problematic because the file names must be cross-platform.

A corpus is often quite big, putting it all into a single XML file results in a big file which, when loaded into memory, blows up even more. FoLiA is designed as a document-based format.

Note that FoLiA does typically group all annotations into the same file, so you have one text with all its annotations in a single XML file (and multiple such files for an entire corpus). How to divide the corpus into separate texts is up to you, whatever makes most sense for your use case.

Moreover, I wonder why the Statement and the Utterance annotations have the src-entity and the text tag not?

The src attribute is a speech attribute, it refers back to the audio/video. In a speech context, you usually have rather than as the main body tag (which does allow the src attribute)

Actually, in the current implementation, I can have several doc tags in a single FoliA file and add the src entity to the doc tag. Unfortunately, If I save such annotation, the src entity is not present in the XML output.

I'm not sure what you mean here, can you show an example?

BTW, I am particularly interested in the PyNLPI library. Do you have a special mailing group for questions?

You probably mean the foliapy library (it used to be part of pynlpl but was split out several years ago). You can use the issue tracker at https://github.com/proycon/foliapy for specific questions about that library.

osherenko commented 1 year ago

Actually, in the current implementation, I can have several doc tags in a single FoliA file and add the src entity to the doc tag. Unfortunately, If I save such annotation, the src entity is not present in the XML output.

I'm not sure what you mean here, can you show an example?

For example, I can instantiate a document as

doc = folia.Document(id='id', src="src")

When I store the document using

doc.save("annotation.xml")

the src is unavilable in the XML file.

BTW, I am particularly interested in the PyNLPI library. Do you have

a special mailing group for questions?

You probably mean the foliapy library (it used to be part of pynlpl but was split out several years ago). You can use the issue tracker at https://github.com/proycon/foliapy for specific questions about that library.

You use the library in the foliafreqlist.py as

from pynlpl.statistics import FrequencyList

I am particularly interested in search algorithms in pynlpl (Chapter 7 in the documentation), for example, using regular expressions like https://nlp.stanford.edu/software/tokensregex.html.

Reply to this email directly, view it on GitHub https://github.com/proycon/folia/issues/105#issuecomment-1321940639, or unsubscribe https://github.com/notifications/unsubscribe-auth/A4B2N3PEZL52ZF4Z45GTMCLWJNO6NANCNFSM6AAAAAARZQV3PU . You are receiving this because you were mentioned.Message ID: @.***>

proycon commented 1 year ago

For example, I can instantiate a document as

doc = folia.Document(id='id', src="src")

When I store the document using

doc.save("annotation.xml")

the src is unavilable in the XML file.

That's because src is not a valid attribute on a FoLiA Document as a whole.

I am particularly interested in search algorithms in pynlpl (Chapter 7 in the documentation), for example, using regular expressions like https://nlp.stanford.edu/software/tokensregex.html.

If you want something like Stanford Tokensregex then look at the FoLiA Query Language instead: https://folia.readthedocs.io/en/latest/fql.html

The search algorithms in pynlpl are very generic (and the implementation is fairly old, you'll probably find better ones elsewhere).

kosloot commented 1 year ago

I am surprised that doc = folia.Document(id='id', src="src") doesn't generate an error message.

proycon commented 1 year ago

I am surprised that doc = folia.Document(id='id', src="src") doesn't generate an error message.

True, me too.

osherenko commented 1 year ago

You probably await an unexpected keyword error in the constructor. Maybe, an exception is not implemented in the constructor. If you want to reproduce:

if name == "main": doc = folia.Document(id='id', src="src")

If I call

text = doc.add(folia.Text, src="src")

I get an unexpected keyword error.

BTW, it stroke me that I can place the src as string in doc.metadata['desc'] to annotate the source of the document. Or is it better to add a Description annotation like

desc = doc.add(folia.Description, value="text source %s" % src)

Am Di., 29. Nov. 2022 um 20:04 Uhr schrieb Ko van der Sloot < @.***>:

I am surprised that doc = folia.Document(id='id', src="src") doesn't generate an error message.

— Reply to this email directly, view it on GitHub https://github.com/proycon/folia/issues/105#issuecomment-1331157345, or unsubscribe https://github.com/notifications/unsubscribe-auth/A4B2N3IBZE6VXKPMMDGIOBDWKZHTBANCNFSM6AAAAAARZQV3PU . You are receiving this because you were mentioned.

Am Mi., 30. Nov. 2022 um 16:41 Uhr schrieb Maarten van Gompel < @.***>:

I am surprised that doc = folia.Document(id='id', src="src") doesn't generate an error message.

True, me too.

— Reply to this email directly, view it on GitHub https://github.com/proycon/folia/issues/105#issuecomment-1332364527, or unsubscribe https://github.com/notifications/unsubscribe-auth/A4B2N3JDEDQ5LOIUOKYMOJLWK5YTBANCNFSM6AAAAAARZQV3PU . You are receiving this because you were mentioned.Message ID: @.***>