monarch-initiative / monarch-legacy

Monarch web application and API
BSD 3-Clause "New" or "Revised" License
42 stars 37 forks source link

Text Annotator for semi-structured text #834

Open cmungall opened 9 years ago

cmungall commented 9 years ago

@mellybelly to edit this description

Currently the annotator service marks up unstructured text.

We want support for semistructured text, e.g. a TSV or Excel file (converted to TSV). The user may specify something broad such as categories for a subset of columns (e.g. col1 may contain gene symbols and would be categorized 'gene', col2 may have disease labels and would be categorized 'disease').

The TSV could be fed to the annotator in bulk, one cell at a time, or row at a time. If the former, the structure could be reconstituted by splitting on tab/nl.

The first thing the user would see is the rows for which one or more columns contained labels that could not be found (particularly full span).

An additional operation could be to compare this with what is in golr; e.g. what new gene-disease associations

Note: depends somewhat on https://github.com/SciGraph/SciGraph/issues/137

mellybelly commented 9 years ago

few thoughts

Currently the annotator service marks up unstructured text. => and provides links to content within Monarch

Also, we'd probably want some guidelines on how long can a text block be, specifics regarding any issues to avoid (e.g if there are particular symbols or fonts or formatting that would screw anything up)

Love the idea of being able to say - hey take this spreadsheet and tell me what kinds of genes or diseases i have in a column- which may be a block of free text or a short string

jmcmurry commented 9 years ago

Great! This semi-structured use case is very similar to my prior work with Zoomage. We should chat about this when I have Internet working. Related to the curation dashboard vision as well. Best, Julie

Sent from my iPhone