Closed jpfairbanks closed 5 years ago
Per markdown file is a good scope to avoid collisions.
I've collected all the markdown files from the chapters section of the epirecipe cookbook and ran them through Eidos using the CoreNLPParser. The outputs, automatic annotations, for each corresponding file have been saved as JSON files. Attached is example output. I think the output is actually pretty encouraging.
Also, want to note there are separate tasks involved in recognizing entities, recognizing what portion of text represents the definition of that entity, and recognizing properties of those entities once successfully matched. This is not something we can do with Eidos I don't think within 8 hours.
From what I'm seeing using the Eidos framework, it is loading supervised models when calling different processors.
I'm looking into writing rules based on the original formulation above though.
This looks great! I agree that the output is encouraging. I think the rule Christine had in mind was approximately
"concept", "concept" where the first concept is more than 4 characters and the second concept is less equal one word.
I think that is a good first order approximation of a "variable definition" rule.
From the description of eidos that I got from @adarshp, I understood that you could write rules that take the output of other rules to define higher level extractions.
I don't know how to translate
"concept", "concept" where the first concept is more than 4 characters and the second concept is less equal one word.
into an eidos rule, but it should be within the scope for what eidos can detect.
Also I think it is interesting that the concept and property rules treat "at a per-capita rate $\Beta I$" differently from "at a per-capita rate $\gamma$" in that sentence.
Thanks @jpfairbanks . I agree, it would be nice to be able to extract definitions like we were talking about for filling in human explainable generations.
So the rule-based portions of the system are defined from Odin grammars, and they link to this manual: https://arxiv.org/pdf/1509.07513.pdf I need to read it in detail; I'm not sure yet if Odin is being developed with definition extraction in mind where a phrase is recognized and associated with some other phrase as being a definition of it from the text. It sounds like Odin could support it if we just find the right level of POS to use when defining the rules. Here's an example from the manual
Hi @adarshp, can you point me to what James is referring to? I've looked at a few Odin related papers but I don't see any sort of description of using Odin to extract definitions using rules per se, just events, relationships, etc. Would you be able to comment on whether or not a system like Odin is suited to do that? I would imagine something like term definition would be discussed somewhere among other concepts like events and relationships given the generality of it.
@jpfairbanks The attached image, page 16 of the manual, example 15, is pretty close to what we want. except, we want to specify that some previously recognized entity (e.g. "Concept") to be used as the trigger instead of the lemma. I've been looking over the manual and I read at one point a trigger needs to be token sequence e.g. a verb like "phosphorylation" (which could be defined as an event). If the trigger could be a previously recognized entity, then I think we could write definition extraction rules of different forms.
@jpfairbanks Eidos has finished running on all the Epirecipe markdown documents, saved as JSON structured exports. Do you want to have those committed here or no?
Yes that example is very similar to what we want. The explanation I was referring to was a verbal explanation so I don't have a document for it unfortunately.
If they are small enough, attach them to the issue. Otherwise share them in box.
@jpfairbanks Ah ok got it. I'll share on box. Also it's actually still running but will post what has been annotated so far. For actually looking at examples though, I recommend we use the web app annotation viewer, and we can do that tomorrow for specific examples but putting in the text to the web app.
Cool. I'll take a look and we can talk tomorrow.
Hi @scottagt, I'm going to actually redirect your question to @marcovzla, who is the author of ODIN - I haven't personally written ODIN rules, but my sense is that a rule to extract definitions is very much in the scope of ODIN.
Thanks Adarsh.
@marcovzla, we are looking at this textbook and trying to pull out some useful information from the text. One thing that should be really easy to extract is the conceptual definitions of variables that will later be used in equations and code. You can see the example at the top of this thread.
Is there some resource for generally useful rules?
Besides the ODIN manual, there is also a repo with some small examples, the ODIN Wiki and the REACH reader which is a mature project that uses ODIN rules to extract information about biochemical reactions. Perhaps these might be useful to look at?
Adarsh, thanks for this information!
From: Adarsh Pyarelal notifications@github.com Sent: Tuesday, December 11, 2018 7:36:09 PM To: jpfairbanks/SemanticModels.jl Cc: Appling, Scott; Mention Subject: [BULK] Re: [jpfairbanks/SemanticModels.jl] Run eidos to find definitions from cookbook (#12)
Besides the ODIN manualhttps://arxiv.org/abs/1509.07513, there is also a repo with some small exampleshttps://github.com/clulab/odin-examples, the ODIN Wikihttps://github.com/clulab/processors/wiki/ODIN-(Open-Domain-INformer) and the REACH readerhttps://github.com/clulab/reach which is a mature project that uses ODIN rules to extract information about biochemical reactions. Perhaps these might be useful to look at?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/jpfairbanks/SemanticModels.jl/issues/12#issuecomment-446416917, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AYtl0N-POOwG9v7mistFZZx33sKCXnXkks5u4E95gaJpZM4ZN1D2.
It looks like we have a draft of a rule that catches the variable definitions.
-name: simple-np-def
label: Definition
priority: 2
type: token
pattern: @effect:Concept ("," @cause:Concept)
The next step is to define some more rules to catch more complicated definitions.
Yes, I created that draft rule taking into account the nature of the web app hardcoding effect/cause arguments in order to visualize whatever Definitions are extract. Yes, I want to look at @crherlihy's list of definitions and develop at least two new rules.
So if we run Odin (specifically the ExtractAndExport application) we get text mentions that look like this:
{
"documents": {
"877918435": {
"id": "input1.txt",
"text": "This is text.\nText in a file.\nFiles are cool.\n",
"sentences": [
{
"words": [
"This",
"is",
"text",
"."
],
"startOffsets": [
0,
5,
8,
12
],
"endOffsets": [
4,
7,
12,
13
],
"raw": [
"This",
"is",
"text",
"."
],
"tags": [
"DT",
"VBZ",
"NN",
"."
],
"lemmas": [
"this",
"be",
"text",
"."
],
"entities": [
"O",
"O",
"O",
"O"
],
"norms": [
"O",
"O",
"O",
"O"
],
"chunks": [
"B-NP",
"B-VP",
"B-NP",
"O"
],
"graphs": {
"universal-enhanced": {
"edges": [
{
"source": 2,
"destination": 0,
"relation": "nsubj"
},
{
"source": 2,
"destination": 1,
"relation": "cop"
},
{
"source": 2,
"destination": 3,
"relation": "punct"
}
],
"roots": [
2
]
},
"universal-basic": {
"edges": [
{
"source": 2,
"destination": 0,
"relation": "nsubj"
},
{
"source": 2,
"destination": 1,
"relation": "cop"
},
{
"source": 2,
"destination": 3,
"relation": "punct"
}
],
"roots": [
2
]
}
}
},
{
"words": [
"Text",
"in",
"a",
"file",
"."
],
"startOffsets": [
14,
19,
22,
24,
28
],
"endOffsets": [
18,
21,
23,
28,
29
],
"raw": [
"Text",
"in",
"a",
"file",
"."
],
"tags": [
"VB",
"IN",
"DT",
"NN",
"."
],
"lemmas": [
"text",
"in",
"a",
"file",
"."
],
"entities": [
"O",
"O",
"O",
"O",
"O"
],
"norms": [
"O",
"O",
"O",
"O",
"O"
],
"chunks": [
"B-VP",
"B-PP",
"B-NP",
"I-NP",
"O"
],
"graphs": {
"universal-enhanced": {
"edges": [
{
"source": 3,
"destination": 2,
"relation": "det"
},
{
"source": 3,
"destination": 1,
"relation": "case"
},
{
"source": 0,
"destination": 3,
"relation": "nmod_in"
},
{
"source": 0,
"destination": 4,
"relation": "punct"
}
],
"roots": [
0
]
},
"universal-basic": {
"edges": [
{
"source": 3,
"destination": 2,
"relation": "det"
},
{
"source": 3,
"destination": 1,
"relation": "case"
},
{
"source": 0,
"destination": 3,
"relation": "nmod"
},
{
"source": 0,
"destination": 4,
"relation": "punct"
}
],
"roots": [
0
]
}
}
},
{
"words": [
"Files",
"are",
"cool",
"."
],
"startOffsets": [
30,
36,
40,
44
],
"endOffsets": [
35,
39,
44,
45
],
"raw": [
"Files",
"are",
"cool",
"."
],
"tags": [
"NNS",
"VBP",
"JJ",
"."
],
"lemmas": [
"file",
"be",
"cool",
"."
],
"entities": [
"O",
"O",
"O",
"O"
],
"norms": [
"O",
"O",
"O",
"O"
],
"chunks": [
"B-NP",
"B-VP",
"B-ADJP",
"O"
],
"graphs": {
"universal-enhanced": {
"edges": [
{
"source": 2,
"destination": 0,
"relation": "nsubj"
},
{
"source": 2,
"destination": 1,
"relation": "cop"
},
{
"source": 2,
"destination": 3,
"relation": "punct"
}
],
"roots": [
2
]
},
"universal-basic": {
"edges": [
{
"source": 2,
"destination": 0,
"relation": "nsubj"
},
{
"source": 2,
"destination": 1,
"relation": "cop"
},
{
"source": 2,
"destination": 3,
"relation": "punct"
}
],
"roots": [
2
]
}
}
}
]
}
},
"mentions": [
{
"type": "TextBoundMention",
"id": "T:689263908",
"text": "This",
"labels": [
"Concept",
"Entity"
],
"tokenInterval": {
"start": 0,
"end": 1
},
"characterStartOffset": 0,
"characterEndOffset": 4,
"sentence": 0,
"document": "877918435",
"keep": true,
"foundBy": "simple-np"
},
{
"type": "TextBoundMention",
"id": "T:-1008405019",
"text": "is",
"labels": [
"Concept",
"Entity"
],
"tokenInterval": {
"start": 1,
"end": 2
},
"characterStartOffset": 5,
"characterEndOffset": 7,
"sentence": 0,
"document": "877918435",
"keep": true,
"foundBy": "simple-vp"
},
{
"type": "TextBoundMention",
"id": "T:2114757955",
"text": "Text in a file",
"labels": [
"Concept",
"Entity"
],
"tokenInterval": {
"start": 0,
"end": 4
},
"characterStartOffset": 14,
"characterEndOffset": 28,
"sentence": 1,
"document": "877918435",
"keep": true,
"foundBy": "simple-vp"
},
{
"type": "TextBoundMention",
"id": "T:174456323",
"text": "Files",
"labels": [
"Concept",
"Entity"
],
"tokenInterval": {
"start": 0,
"end": 1
},
"characterStartOffset": 30,
"characterEndOffset": 35,
"sentence": 2,
"document": "877918435",
"keep": true,
"foundBy": "simple-np"
},
{
"type": "TextBoundMention",
"id": "T:-1926935776",
"text": "are",
"labels": [
"Concept",
"Entity"
],
"tokenInterval": {
"start": 1,
"end": 2
},
"characterStartOffset": 36,
"characterEndOffset": 39,
"sentence": 2,
"document": "877918435",
"keep": true,
"foundBy": "simple-vp"
}
]
}
What is the ideal storage format for them?
@scottagt can you bring the new rules that you wrote into the new repo https://github.com/ml4ai/automates/tree/master/text_reading/src/main/resources/org/clulab/aske_automates/grammars and add them here. then rerun it on the cookbook?
@jpfairbanks Sure thing. Do we have our own branch of that repo that we want to use as a staging place? And then before making a push request to their repo, I'd like to check with them about pushing our new rules file and where they want to put it.
go ahead and make a fork of their master branch under your account, or push a branch to this fork that I made. https://github.com/jpfairbanks/automates
@jpfairbanks I got to investigate some new options and got more familiar with the new automates code base. I've ported over the new rules into the system and made changes in a branch on our fork of their repo. I'm going to submit a pull request and you can have a look. Attached are sample input and outputs from testing. The output is the json file (had to zip it). "Definition" was found as expected. Going to close this one since it's beginning to meander. :)
Variable definitions from text
Input
Output
Definitions
Properties of things