impresso / impresso-schemas

Repository of JSON schemas used in the Impresso project.
GNU Affero General Public License v3.0
3 stars 3 forks source link

schema for linguistic annotations #10

Closed e-maud closed 4 weeks ago

e-maud commented 5 years ago

Hello @pstroe, @simon-clematide , @mromanello,

Many thanks @pstroe for this schema! Here come a few comments:

simon-clematide commented 5 years ago

Hello @pstroe, @simon-clematide , @mromanello,

Many thanks @pstroe for this schema! Here come a few comments:

  • the schema could be in another repository than newspapers, e.g. ling_annotations. In my understanding newspapers folder gather schemas related to the description of this object, while additional layers are apart (like topic_model).

I totally agree. "linguistic_annotation" makes sense for me. It is independent of the specifics of newspapers.

  • the json schema is not a valid json:
    • there is a pb with a missing } at the end
    • and with the timestamp pattern. This one works and is shorter: "pattern": "^[0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[0-9]{2}:[0-9]{2}Z$"
  • the title of the P schema should be part of speech (copy-paste from previous)
  • would it be possible to have a example?
  • overall we should think of the future usage of this annotation. If for internal sharing purposes, I think it perfectly does the job. However, if we want to represent more spanning annotations and confidence levels, it will not be sufficient I think. E.g. for named entities we might have several system outputs.

I could think about this format for some basic standard types of linguistic annotations:

Regarding the problem of representing different system's output: We tried several things in the SPARCLING project. I think it is not related to the JSON schema definition. I would rather prefer to have several outputs of several systems stored in separat JSON files. E.g. one file from spacy, one file from aida etc.etc. Representing a unified annotation layer where things might even diverge on the level of tokenization seems nightmarish to me. I would rather represent the individual system's outputs in different files and then define procedures to maybe harmonized the annotations for the "final" indexing in the interface. Offering all the complexities of different annotations to the user will be quite a headache and I see it as a responsability of us as NLP experts to make some decisions and not just throw everything at the historians.

e-maud commented 5 years ago

Regarding the problem of representing different system's output: We tried several things in the SPARCLING project. I think it is not related to the JSON schema definition. I would rather prefer to have several outputs of several systems stored in separat JSON files. E.g. one file from spacy, one file from aida etc.etc. Representing a unified annotation layer where things might even diverge on the level of tokenization seems nightmarish to me. I would rather represent the individual system's outputs in different files and then define procedures to maybe harmonized the annotations for the "final" indexing in the interface. Offering all the complexities of different annotations to the user will be quite a headache and I see it as a responsability of us as NLP experts to make some decisions and not just throw everything at the historians.

Sure, I agree, I was also thinking of one json files per system and naturally things will be reconciled before indexation. In the case of NEs I think we however do not need to re-write the source text (sents/token/lemmas) each time, and that a complete standoff annotation would be ok? E.g. for now the output of the rule-based system is in JSON, with offsets pointing to rebuilt. This means going away from IOB, but having the capacity to store more infos, e.g. title, function, nested entities, wikidata id, etc. (although I know IOB can store several layers).

The present schema is all ok for sents/tok/lemmas, but perhaps for NEs we have to think of all infos and processes we will need to run: Infos:

Processes:

Here is a excerpt of the rule-based output in json, used for ingestion:

{
  "id": "EXP-2002-07-16-a-i0200",
  "s3v": "null",
  "ts": "2019-02-08T10:01:40Z",
  "sys_id": "rb",
  "nes": [
    {
      "type": "Location",
      "subType": "countryName",
      "surface": "Irlande",
      "name": "Irlande",
      "lOffset": 398,
      "rOffset": 405
    },
    {
      "type": "Person",
      "surface": "David Byrne",
      "name": "David Byrne",
      "firstname": "David",
      "surname": "Byrne",
      "lOffset": 545,
      "rOffset": 556,
      "mrule": "person1_basic_0_passed",
      "confidence": "high"
    },
  ]
}

This can also be generated from IOB.

What do you think? If that could help I am happy to draft a NE json schema (which we might need in anycase)

pstroe commented 5 years ago

thanks for the comments @e-maud

{ "ts": "2019-3-12T0:13:42", "id": "JDG-1901-06-25-a-i0001", "sents": [ { "sid": 1, "tok": [ { "tid": 1, "t": "BULLETIN", "l": "bulletin", "p": "NC", "idx": 0, "ner": "ORG", "iob": "B" }, { "tid": 2, "t": "GENÈVE", "l": "GENÈVE", "p": "NPP", "idx": 9, "ner": "ORG", "iob": "I" }, { "tid": 3, "t": ",", "l": ",", "p": "PONCT", "idx": 15,

pstroe commented 5 years ago

a token's ner info looks different now (like suggested by simon) :

"tok": [ { "tid": 1, "t": "BULLETIN", "l": "bulletin", "p": "NC", "idx": 0, "ner": "B-ORG", },

simon-clematide commented 5 years ago

We need the onsets/offsets with respect to the rebuild files.