Generalize notes to allow external references for e.g. normalization

ghost commented 13 years ago

Provide a general framework to attach URL;s and notes to any annotation. This may require some minor re-work of the annotation format and protocol but should be feasible with minor modifications.

Something like:

ID    ANNOTATION_TYPE DATA

ghost commented 13 years ago

We probably should convert URL;s into links if we spot them in the comments anyway to make it more user friendly.

spyysalo commented 13 years ago

I don't get the title of this issue.

Note support should already be working for everything except event roles using the "comment" type, e.g. as

#1    E1 AnnotatorNote     see http://www.ncbi.nlm.nih.gov/pubmed?term=12454525

I'm fine with the suggestion to add a tweak to turn obvious URLs into links, but maybe I'm missing the bigger point here..?

ghost commented 13 years ago

One of the points is that you may want to turn, let's say PMID4711 into an actual link so that the user would be able to access it more quickly. I guess we are essentially arguing if we will have a completely unstructured annotation field and then combine it with the quick-links which has some more formalism to them.

spyysalo commented 13 years ago

Renamed more sensibly; revise if you disagree.

spyysalo commented 13 years ago

Based on discussion between developers, suggesting to implement this as follows

Define a protocol to allow "note" types to be defined in config along with (initially) basic support for turning the text content of a note of a given type into a link, e.g by catenating it to a "base" URL.
Extend collection info client-server protocol to include info on what "note" types are defined (in addition to the basic AnnotatorNote)
Modify client to allow other "note" types to be filled. The client won't consider the note semantics and will just fill them as text
Implement a bit in the server and corresponding extensions of client-server protocol and UI for making the URL corresponding to each filled "note" visible to the user

If we're broadly in agreement, I can start this off by doing the config bit.

ghost commented 13 years ago

I have no objections, possibly that it could be a part of some greater client plug-in scheme but that will have to wait.

spyysalo commented 13 years ago

OK, will move forward with this. On the config side, how about

[Meta]
UniProtID     Arg:Protein, <URLBase>:http://www.uniprot.org/uniprot/

Here, [Meta] would be a new section specifying meta-annotations (annotations for annotations), and the single example line would specify a meta-annotation of type UniProtID that can be attached to Proteins. <URLBase> would be a new magic value that, when specified, would be used by the server to generate links by simple concatenation with the text content of any non-empty UniProtID field.

Any meta-annotation lacking a <URLBase> (or, in the future, comparable mechanism that can be used to infer semantics) would be treated by the system as simply specifying a(nother) free-text "note" field similar to AnnotatorNotes.

ghost commented 13 years ago

I think (sadly) that the URLBase functionality may require some additional tweaking. I can foresee cases where concatenation won't cut it, so I'd suggest some templating instead. Something like:

UniProtID     Arg:Protein, <URLBase>:http://www.uniprot.org/uniprot/%s/q

'%s' should be chosen to be something a "real" URL won't contain.

spyysalo commented 13 years ago

Good tweak. If there are no other comments, I'll implement the config support as suggested.

spyysalo commented 13 years ago

This should ideally be implemented so that DB links can also be attached to annotations by type, without separately adding the info to each annotation of the type. For example, it should be possible to configure the system so that every annotation of type "DNA methylation" gets automatically linked to http://amigo.geneontology.org/cgi-bin/amigo/term_details?term=GO:0006306.

Previously part of #190.

ghost commented 13 years ago

Okay, we are really talking logic now, @amadanmath is going to ballistic if we send JS to the client (not to mention that it would make me sick...). Should we make some sort of API call to convert a string tuple into a URL?

spyysalo commented 13 years ago

Just URLs should be enough. Agreed that executable JS should be kept out of the protocol.

amadanmath commented 13 years ago

What is the result of this on the client? Where are these links being displayed? Can I has an use case? へ`(^_^)'

spyysalo commented 13 years ago

The links would be shown in the span popup (where it would be possible to edit the non-fixed part), and if we get a similar popup for non-logged in users, then there also. The primary use case I'm thinking about is normalization; annotation marking for e.g. each named entity occurrence the entry in a DB like Uniprot the DB ID identifying the entity referred to.

For example, for a text like "we study human p53 ..." the annotation would identify "p53" as an entity and assign it the normalized ID P04637. The system would then provide a configurable facility for turning that ID into the link http://www.uniprot.org/uniprot/P04637.

amadanmath commented 13 years ago

if we get a similar popup for non-logged in users

What do you mean, if we get it? :p

(Which reminds me, I didn't put it into the offline xhtml...)

So - ID is editable, link is autogenerated from that, and the template to generate it is linked to the span type.

Let's say... span has an additional parameter; so far spans are transported as plain arrays (which kind of sucks, but... legacy), so let's say you give it to me as the next array element. I'll send it back in edit requests as id. Span type definition will have an extra field provisionally called urltemplate (change if you disagree). When both id and urltemplate are present, the url will be added to the links section of the span form.

Remaining questions:

What will be the text on the link? Will that too be configurable?
Where do I stick the id edit field in the already overcrowded form? :p Would it be okay if I stick it to the right side of the links section?
And is it limited to one link? Are you sure you will never want to have, say, two configurable links, for two different databases?

'%s' should be chosen to be something a "real" URL won't contain.

Most browsers use the same schema (%s) to specify search query URLs. If it's good enough for Google and Mozilla, it's good enough for me.

ghost commented 13 years ago

What will be the text on the link? Will that too be configurable?

I think we can go with just the identifier. Everything else would require us to have knowledge about the resource itself which we certainly don't want to.

Where do I stick the id edit field in the already overcrowded form? :p Would it be okay if I stick it to the right side of the links section?

The problem is that I can easily see use-cases where annotators want to normalise towards multiple databases, so a single ID field won't cut it. Am I right regarding this @spyysalo?

And is it limited to one link? Are you sure you will never want to have, say, two configurable links, for two different databases?

We will definitively need multiple links, but I think we can safely say that if we have a mechanism to supply one link using template X and another using template Y we are agnostic if X and Y are to the same resource or not.

'%s' should be chosen to be something a "real" URL won't contain.

Most browsers use the same schema (%s) to specify search query URLs. If it's good enough for Google and Mozilla, it's good enough for me.

Possible Python influence. (side-track)

@amadanmath: Thanks for using and pointing out the > for replies, I'll be relying on it from now on.

spyysalo commented 13 years ago

The problem is that I can easily see use-cases where annotators want to normalise towards multiple databases, so a single ID field won't cut it.

Normalization against multiple resources is possible, but expected to be rare. As long as we make the basic design flexible enough to do this, it might be enough to just consider the one-link-per-annotation case in the initial UI design for now.

amadanmath commented 13 years ago

Normalization against multiple resources is possible

Ugh, then we need to sit together and rethink this...

spyysalo commented 12 years ago

Related suggestion from user:

when you need to frequently check things, a quick link to the current 
article in PubMed would be helpful.

Not sure if we can support this in the same framework, though.

mlneves commented 12 years ago

Hello! Any idea whether and when this issue would be available? In our project we need to validate identifiers (associated to annotations) which can come from many databases/ontologies: EntrezGene, GO, FMA, etc.

spyysalo commented 12 years ago

This would be useful for us too, and I'd like to have it demoable at EACL (late next month). It would be helpful if you could share your ideas on how this should work, both on the UI and in the configuration / validation implementation. Any help in implementation is of course also very welcome :-)

On Fri, Mar 23, 2012 at 9:07 PM, mlneves reply@reply.github.com wrote:

Hello! Any idea whether and when this issue would be available? In our project we need to validate identifiers (associated to annotations) which can come from many databases/ontologies: EntrezGene, GO, FMA, etc.

Reply to this email directly or view it on GitHub: https://github.com/TsujiiLaboratory/brat/issues/324#issuecomment-4657832

spyysalo commented 12 years ago

Couple of things:

the annotation should identify both the resource (e.g. UniProt, Entrez Gene) against which normalization is provided as well as the identifier within that resource. A possible format would be colon-separated, like UniProt:Q9ULZ0
the annotation storage should probably also identify some "human-readable" form of normalized IDs (e.g. also "p53" instead of just something like "UniProt:Q9ULZ0"). Although this can be recovered given knowledge of the source DB, including it would allow the annotations to be understood also when transported to a platform where the configuration providing this association is not included.

spyysalo commented 12 years ago

Suggestion for a new line format for this:

T1<TAB>Protein 0 3<TAB>p53
N1<TAB>Reference T1 UniProt:P04637<TAB>Cellular tumor antigen p53

The T1 line here is just a standard text-bound annotation; the N1 line is the new one.

There are two things this specific format permits:

As the external ID (P04637 here) is the last thing in the second TAB-separated field, this can, if necessary, be allowed to contain any character other than TAB or newline. This puts minimum restrictions on the format of external IDs.
The "human-readable" third TAB-separated field allows the UI to display something, well, human-readable even in cases where the annotations are transferred to a setup where the ID-term mapping is not configured.

Comments?

ghost commented 12 years ago

I am supporting it, I can't see any obvious way to break it and it should be enough to cover any identifier (let's just hope no one is daft enough to go for tuples...).

mlneves commented 12 years ago

sorry for the delay, I was in vacation.

The reference looks nice, although the last field (human readable something) will not be mandatory, right?

I guess you will allow to specify more than one id for an entity, in this case we would have N1 and N2 both referring to T1, right?

Regarding the GUI, if someone wants to assign/check/change the id, i.e., choose a new concept in the terminology/ontology, how would it be shown? It would be nice to have the ontology shown as a tree and then one can select one (or more) concepts (ids). But it is silly to convert the ontology to some Brat format, is there a way to integrate some available OBO/OWL viewer into Brat? Although it might be not only a viewer as some data will have to be passed from one tool to the other.

In fact, the annotation schema could be represented as an ontology. For the CellFinder project we have a huge ontology and it would be nice to have it as annotation schema. Instead of choosing "gene", one would choose a specific gene in (let's say) EntrezGene. Just as choosing kidney in FMA instead of just "organ". And choosing a biological process in GO for a event. This is probably not practical for annotating a text but really helpful for validating text mining output (curation). For the annotation, suggesting a term in an ontology (some simple exact/approximate matching) would be a plus.

By the way, talking to Florian Leitner at the BioCuration conference, he told me he implemented his annotation schema as a tree structure in Brat. I didn't know it was possible, is it described at Brat website? It should certainly be.

On 04/18/2012 05:26 AM, spyysalo wrote:

Suggestion for a new line format for this:
 T1<TAB>Protein 0 3<TAB>p53
 N1<TAB>Reference T1 UniProt:P04637<TAB>Cellular tumor antigen p53
The T1 line here is just a standard text-bound annotation; the N1 line is the new one.

There are two things this specific format permits:

As the external ID (P04637 here) is the last thing in the second TAB-separated field, this can, if necessary, be allowed to contain any character other than TAB or newline. This puts minimum restrictions on the format of external IDs.

The "human-readable" third TAB-separated field allows the UI to display something, well, human-readable even in cases where the annotations are transferred to a setup where the ID-term mapping is not configured.

Comments?

Reply to this email directly or view it on GitHub: https://github.com/TsujiiLaboratory/brat/issues/324#issuecomment-5191140

spyysalo commented 12 years ago

@mlneves: thanks for the response, no problem about the time :-)

A couple of responses in brief:

Yes, we can make the "human-readable" string field optional. (Out of curiosity: do you have a use case where you would prefer not to need to provide this?)
We're currently thinking of having the following options for entering normalizations (both of these require an ID-human-readable-string mapping to be stored on the brat server):
- look up ID using some external service (e.g. Uniprot or FMA Explorer), fill in ID in text field (first very rough implementation committed yesterday, please let me know if you're interested in testing this!)
- enter "human-readable" string, click search, and pick the ID from a list.
We've considered ontology browsing support in some tree-like system, but as you note, including full support for this in brat is far from trivial (if you know of an open-source web-based OBO/OWL viewer we could integrate, let us know!). Similarly, mechanisms allowing direct "import" of a selection made in an external tool are not technically straightforward to implement.
The annotation schema can be ordered as a tree; this has been included as a feature from early on and is now documented at http://brat.nlplab.org/configuration.html#annotation-configuration . Sorry we didn't inform you about this feature earlier!
Integrating a full ontology of Entrez Gene / FMA size into the brat annotation dialog is not feasible with the current implementation (the client always known the full set of types) and would require the dialog to be made dynamic. (This basically comes back to implementing a full ontology browser.) That said, we will integrate the brat entity/event type system with the normalization system, so that you can specify e.g. that gene type annotations should be normalized to Entrez Gene and organ type annotations to FMA.
For type tree config, see the link above.

spyysalo commented 12 years ago

The implementation in 03a4137eb50707920d2d03f909b4e5e5d0ed3c88 still has a lot of rough spots and missing support, but the basic stuff is there both on client and server. Open more specific issues for what remains.

nlplab / brat

Generalize notes to allow external references for e.g. normalization #324