schema for linguistic annotations

e-maud commented 5 years ago

Hello @pstroe, @simon-clematide , @mromanello,

Many thanks @pstroe for this schema! Here come a few comments:

the schema could be in another repository than newspapers, e.g. ling_annotations. In my understanding newspapers folder gather schemas related to the description of this object, while additional layers are apart (like topic_model).
the json schema is not a valid json:
- there is a pb with a missing } at the end
- and with the timestamp pattern. This one works and is shorter: "pattern": "^[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}Z$"
the title of the P schema should be part of speech (copy-paste from previous)
would it be possible to have a example?
overall we should think of the future usage of this annotation. If for internal sharing purposes, I think it perfectly does the job. However, if we want to represent more spanning annotations and confidence levels, it will not be sufficient I think. E.g. for named entities we might have several system outputs.

simon-clematide commented 5 years ago

Hello @pstroe, @simon-clematide , @mromanello,

Many thanks @pstroe for this schema! Here come a few comments:

the schema could be in another repository than newspapers, e.g. ling_annotations. In my understanding newspapers folder gather schemas related to the description of this object, while additional layers are apart (like topic_model).

I totally agree. "linguistic_annotation" makes sense for me. It is independent of the specifics of newspapers.

the json schema is not a valid json:

there is a pb with a missing } at the end

and with the timestamp pattern. This one works and is shorter: "pattern": "^[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}Z$"

the title of the P schema should be part of speech (copy-paste from previous)

would it be possible to have a example?

overall we should think of the future usage of this annotation. If for internal sharing purposes, I think it perfectly does the job. However, if we want to represent more spanning annotations and confidence levels, it will not be sufficient I think. E.g. for named entities we might have several system outputs.

I could think about this format for some basic standard types of linguistic annotations:

sentence segmentation (there was the questions whether we should enforce the segmentation from the canonical format; for some newspapers it is probably quite good, for others (generated via tet from PDF it is probably worse than what linguistically informed methods could do (given some knowledge about coordinates of the elements on the page).
POS, lemma
Standard NER types: I would maybe represent it directly as IOB-style information that can be directly attached to each token (B-PER).

Regarding the problem of representing different system's output: We tried several things in the SPARCLING project. I think it is not related to the JSON schema definition. I would rather prefer to have several outputs of several systems stored in separat JSON files. E.g. one file from spacy, one file from aida etc.etc. Representing a unified annotation layer where things might even diverge on the level of tokenization seems nightmarish to me. I would rather represent the individual system's outputs in different files and then define procedures to maybe harmonized the annotations for the "final" indexing in the interface. Offering all the complexities of different annotations to the user will be quite a headache and I see it as a responsability of us as NLP experts to make some decisions and not just throw everything at the historians.

e-maud commented 5 years ago

Regarding the problem of representing different system's output: We tried several things in the SPARCLING project. I think it is not related to the JSON schema definition. I would rather prefer to have several outputs of several systems stored in separat JSON files. E.g. one file from spacy, one file from aida etc.etc. Representing a unified annotation layer where things might even diverge on the level of tokenization seems nightmarish to me. I would rather represent the individual system's outputs in different files and then define procedures to maybe harmonized the annotations for the "final" indexing in the interface. Offering all the complexities of different annotations to the user will be quite a headache and I see it as a responsability of us as NLP experts to make some decisions and not just throw everything at the historians.

Sure, I agree, I was also thinking of one json files per system and naturally things will be reconciled before indexation. In the case of NEs I think we however do not need to re-write the source text (sents/token/lemmas) each time, and that a complete standoff annotation would be ok? E.g. for now the output of the rule-based system is in JSON, with offsets pointing to rebuilt. This means going away from IOB, but having the capacity to store more infos, e.g. title, function, nested entities, wikidata id, etc. (although I know IOB can store several layers).

The present schema is all ok for sents/tok/lemmas, but perhaps for NEs we have to think of all infos and processes we will need to run: Infos:

type, subtype
internal components (?)
start and end offset + rebuilt version
confidence score if avail.
system id
entity id (impress, dbpedia, wikidata)

Processes:

voting system (manipulate different outputs and produce a new one)
disambiguation (= clustering) + linking (both need access to text)
"reconciled" ingestion in SOLR (here the lighter the better, but if IOB there could be a pre-exctraction first. Although it's a bit strange to recalculate the offsets (for highlight) from IOB if they were present in system's output)

Here is a excerpt of the rule-based output in json, used for ingestion:

{
  "id": "EXP-2002-07-16-a-i0200",
  "s3v": "null",
  "ts": "2019-02-08T10:01:40Z",
  "sys_id": "rb",
  "nes": [
    {
      "type": "Location",
      "subType": "countryName",
      "surface": "Irlande",
      "name": "Irlande",
      "lOffset": 398,
      "rOffset": 405
    },
    {
      "type": "Person",
      "surface": "David Byrne",
      "name": "David Byrne",
      "firstname": "David",
      "surname": "Byrne",
      "lOffset": 545,
      "rOffset": 556,
      "mrule": "person1_basic_0_passed",
      "confidence": "high"
    },
  ]
}

This can also be generated from IOB.

What do you think? If that could help I am happy to draft a NE json schema (which we might need in anycase)

pstroe commented 5 years ago

thanks for the comments @e-maud

yes, we can change the location of the file of course
i fixed the json and uploaded a new version
here you find an example:

{ "ts": "2019-3-12T0:13:42", "id": "JDG-1901-06-25-a-i0001", "sents": [ { "sid": 1, "tok": [ { "tid": 1, "t": "BULLETIN", "l": "bulletin", "p": "NC", "idx": 0, "ner": "ORG", "iob": "B" }, { "tid": 2, "t": "GENÈVE", "l": "GENÈVE", "p": "NPP", "idx": 9, "ner": "ORG", "iob": "I" }, { "tid": 3, "t": ",", "l": ",", "p": "PONCT", "idx": 15,

pstroe commented 5 years ago

a token's ner info looks different now (like suggested by simon) :

"tok": [ { "tid": 1, "t": "BULLETIN", "l": "bulletin", "p": "NC", "idx": 0, "ner": "B-ORG", },

simon-clematide commented 5 years ago

We need the onsets/offsets with respect to the rebuild files.

impresso / impresso-schemas

schema for linguistic annotations #10