nlpsandbox / nlpsandbox-controller

NLP Sandbox CWL workflow and tools
https://nlpsandbox.io
Apache License 2.0
3 stars 2 forks source link

Support "date", "person" and "location" annotators #33

Closed thomasyu888 closed 3 years ago

thomasyu888 commented 3 years ago

The infrastructure is hard coded to do "date" annotator submissions right now, it needs to be configured to automatically know when to use "date", "person" or "location". The tools that need to be updated

My thought is that adding a tool that creates a mapping from queue to annotator type would be ideal

tschaffter commented 3 years ago

@thomasyu888 At the level of the client and to let the performance evaluation script know for which NLP task it is generating performance metrics, the information of the task should come from getting the Service object from the evaluated tool. For this purpose, I'll add to the Service schema a property that let the client know what API the tool is implementing. Does that make sense?

Depends on https://github.com/nlpsandbox/nlpsandbox-schemas/issues/122

thomasyu888 commented 3 years ago

If we did it this way, technically we only need one queue then?

tschaffter commented 3 years ago

I think so.

thomasyu888 commented 3 years ago

@tschaffter It could be interesting to look into simplifying the annotation POST endpoint. What i mean is that even if the service endpoint returned the type of annotator it is, we still need to map "date" -> "textDateAnnotations", "person" -> "textPersonNameAnnotations" to do the API call. It would actually simplify the clients by a lot if we just had a "/annotate" endpoint. The implementation of each of the API services would then be different based on the "service" the implementor specifies. This endpoint could take "oneOf" three request bodys (TextDateAnnotation, TextPersonNameAnnotation, and TextAddress....)

That being said, I do like that we have explicit endpoints per annotation type. Thoughts?

tschaffter commented 3 years ago

There are different design questions that we need to discuss.

The first specification that I have in mind is to allow developer to create a NLP tool that does more than one task (e.g. date annotation, person name annotation, etc.). The reason is to prevent an explosion of API boilerplates and GitHub repositories for the NLP developers. If they decide to choose so, they can have one tool (API service) that supports all the PHI annotation tasks that we are going to support. To make this design possible, its preferable to have different endpoints rather than one endpoint and different configuration. One endpoint for one task allows a better separation of the tasks in the codebase.

Now this design proposal only makes sense if we can still evaluate the performance of the tool for each task independently. Therefore, we need the tool to tell us in some way what "NLP task" the tool support. This could be an array that takes values from an enum of tasks that we support. This could be specified in the Service object returned by the tool. For example:

{ "nlp_operations_supported": [ "text_date_annotation", "text_person_name_annotation" ] }

The client would then associate "text_date_annotation" with the API endpoint to call (e.g. /api/v1/textDateAnnotations).

Having one submission queue would be really neat for:

Now when the submission is sent to a data hosting site - we should keep in mind that we will have more than one in the future - then the controller needs to determine against which NLP tasks the tool must be evaluated. For, a data hosting site A may have the data to benchmark date annotator but not the data hosting site B. So we need to allow a controller to be configured to specify what tasks are supported. This could be achieved by a configuration file for the controller that lists supported task enum values similar to the JSON snippet shown above.

Taking the above in consideration, this means that a tool submitted for evaluation would be benchmark for the intersection of these tasks: 1) the tasks supported by the tool submitted (specified in the Service object) and 2) the tasks supported by the data hosting site (specified in the controller configuration).

How does that sound?

thomasyu888 commented 3 years ago

After further discussion, we decided that it is best that we have different queues per annotator for now

tschaffter commented 3 years ago

We decided its safer to not actually include the annotator type in the service object for now.

To be more precise, the Service object returned by the tools (microservices) will soon include a list of "NLP tasks" supported so that we can programmatically identify the tasks supported by a tool. We are just not ready to rely on this information yet to decide against which tasks a tool must be evaluated.

thomasyu888 commented 3 years ago

Thanks for the clarification.

github-actions[bot] commented 2 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.