nlpsandbox / nlpsandbox-schemas

OpenAPI specifications of the NLP Sandbox services
https://nlpsandbox.io
Apache License 2.0
2 stars 4 forks source link

date/location/person annotator doesn't return noteId, how should scoring be done? #93

Closed thomasyu888 closed 3 years ago

thomasyu888 commented 3 years ago

Currently the evaluation code relies on "noteId" to be part of the response.

Example- (Both the expected goldstandard and prediction file look like this):

{
    "date_annotations": [
        {
            "noteId": 0,
            "start": 20,
            "length": 10,
            "text": "11/21/2019",
            "dateFormat": "MM/DD/YYYY"
        },
...
tschaffter commented 3 years ago

We store Annotation objects in the AnnotationStores. Here is an example of one Annotation object:

{
      "annotationSource": {
        "name": "/datasets/awesome-dataset/fhirStores/awesome-fhir-store/fhir/Note/5fb5d438a7c859d8acf9d672"
      },
      "textDateAnnotations": [
        {
          "dateFormat": "MM/DD/YYYY",
          "length": 10,
          "start": 42,
          "text": "10/26/2020"
        },
        {
          "dateFormat": "MM/DD/YYYY",
          "length": 10,
          "start": 42,
          "text": "10/26/2020"
        }
      ],
      "textPersonNameAnnotations": [
        {
          "length": 11,
          "start": 42,
          "text": "Chloe Price"
        },
        {
          "length": 11,
          "start": 42,
          "text": "Chloe Price"
        }
      ],
      "textPhysicalAddressAnnotations": [
        {
          "addressType": "city",
          "length": 11,
          "start": 42,
          "text": "Seattle"
        },
        {
          "addressType": "city",
          "length": 11,
          "start": 42,
          "text": "Seattle"
        }
      ]
    }

In this representation, the reference to the "object" annotated is stored in annotationSource. For now we annotate only clinical note (Note schema) but in the future we may annotate different objects. The format of annotationSource is currently being finalized and is related to best practices in handling IDs and Linking.

The above design follows Google Healthcare API. One reason I adopted it is because I already wanted to remove noteId from the specific annotation objects in an effort to simplify the task to NLP developers and to make the system more robust. Fact 1: because an annotation request takes as input a single note (and not a collection of note), it does not make sense to the developer to have to deal with a note id. Instead, it is the responsibility of the user/client to link the annotation prediction received to the note given as input to the annotator. Thus, the developer of an NLP tool is not responsible for incorrectly linking a note and the annotation extracted from it. Even if we were to ask the NLP developer to do the linking, we would have to write code that check that the linking is correct.

In the current schemas, a Date Annotator returns an array of TextDateAnnotation objects. For example:

{
  "textDateAnnotations": [
    {
      "format": "MM/DD/YYYY",
      "length": 10,
      "start": 3,
      "text": "12/26/2020"
    },
    {
      "format": "YYYY",
      "length": 4,
      "start": 9,
      "text": "2020"
    }
  ]
}

After receiving this output, the client should create an Annotation object that effectively link the source/reference of the note and the annotator output.

{
      "annotationSource": {
        "name": "/datasets/awesome-dataset/fhirStores/awesome-fhir-store/fhir/Note/5fb5d438a7c859d8acf9d672"
      },
    "textDateAnnotations": [
    {
      "format": "MM/DD/YYYY",
      "length": 10,
      "start": 3,
      "text": "12/26/2020"
    },
    {
      "format": "YYYY",
      "length": 4,
      "start": 9,
      "text": "2020"
    }
  ]
}

This Annotation object can then be stored in an AnnotationStore.

The evaluation script should take as input two arrays of Annotation objects: one array containing the gold standard annotation and the other one the prediction. From my perspective, we can move on with updating the evaluation script to support this format if we all agree with it and once the format of property annotationSource is finalized.

@thomasyu888 @yy6linda What are your thoughts?

yy6linda commented 3 years ago

Hi @tschaffter,

I think noteId is important. I am in favor of keeping it. Treating each object as a note and assigning a unique index helps us to keep track of the notes. Besides each "start" index in the annotation is associated with specific noteId and only makes sense when we know which note the annotation refers to.

tschaffter commented 3 years ago

@thomasyu888 @yy6linda The purpose of the Annotation object is to group specific annotations like Annotation.textDateAnnotations and Annotation.textPersonNameAnnotations. The first reason is that it's convenient to have "all" the specific annotations for a resource grouped in a single object (actually the "all" part is not fully true, see below). The second reason is that this lead to having less objects stored in the DB collection "annotations" than if we were to store specific annotation objects. This lead in turn to faster queries as the DB has to search through a smaller collection, thus being able to return the result faster.

Making the contract that the specific annotations listed in an Annotation object are linked to the same source, we can then specify this source once in the property Annotation.annotationSource instead of specifying it for each specific annotation objects like in

    {
      "format": "MM/DD/YYYY",
      "length": 10,
      "start": 3,
      "text": "12/26/2020",
      "annotationSource": ".../my-awesome-note"
    },
    {
      "format": "YYYY",
      "length": 4,
      "start": 9,
      "text": "2020",
      "annotationSource": ".../my-awesome-note"
    }

which would break the DRY (Do not Repeat Yourself) principle, and make the object bigger which would slow down the response time.

For the above reasons and a few others, let's go ahead with the design proposed initially. The format of the property Annotation.annotationSource is being reviewed in #99 and its implementation (#101 ).

Back to the scoring approach referenced in the title of this ticket:

The evaluation module of the client should expect to receive two arrays of Annotation objects. Such object could come from a Data Node or from JSON files specified by the user. Then, it's up to the evaluation script to transform the Annotation objects if needed.

The first operation is probably to collapse each array of Annotation objects to an internal type of objects that is easier for the evaluation script to process.

Using the Annotation object described above as input also provide information on the type of a list of specific annotations. For example, the property Annotation.textDateAnnotations is an array of TextDateAnnotation objects. The evaluation script can then check the Annotation object to see if Annotation.textDateAnnotations is not None and has at least one item. If this is the case, the evaluation script should compute and return the performance for the TextDateAnnotation objects.

The same check should be performed for the other types of specific annotations like Annotation.textPersonNameAnnotations and Annotation.textPhysicalAddressAnnotations.

=> If the gold standard includes at least one Annotation object that includes at least one TextDateAnnotation object, then the script should return performance for date annotation. If in addition the gold standard includes at least one Annotation object that includes at least one TextPersonNameAnnotation object, then the evaluation script should also return the performance for person name annotation.

@thomasyu888 You are more familiar than @yy6linda with the data node design. Could you please write the methods required to transform the input described above to the format required by the performance evaluation score?

tschaffter commented 3 years ago

@thomasyu888 The property Annotation.annotationSource has been updated in the schemas and implemented in the data node.

tschaffter commented 3 years ago

@thomasyu888 I closed this ticket because the format of the Annotation should now be in its final form. Further discussion on how to implement the evaluation of annotations should be done in the client repository.

github-actions[bot] commented 2 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.