jpfairbanks / SemanticModels.jl

A julia package for representing and manipulating model semantics
MIT License
77 stars 17 forks source link

Run eidos to find definitions from cookbook #12

Closed jpfairbanks closed 5 years ago

jpfairbanks commented 5 years ago

Variable definitions from text

Input

Description

The susceptible-infected-recovered (SIR) model in a closed population was proposed by Kermack and McKendrick as a special case of a more general model, and forms the framework of many compartmental models. Susceptible individuals, S, are infected by infected individuals, I, at a per-capita rate βI, and infected individuals recover at a per-capita rate γ to become recovered individuals, R.

Output

Definitions

Properties of things

jpfairbanks commented 5 years ago

Per markdown file is a good scope to avoid collisions.

scottagt commented 5 years ago

I've collected all the markdown files from the chapters section of the epirecipe cookbook and ran them through Eidos using the CoreNLPParser. The outputs, automatic annotations, for each corresponding file have been saved as JSON files. Attached is example output. I think the output is actually pretty encouraging.

Also, want to note there are separate tasks involved in recognizing entities, recognizing what portion of text represents the definition of that entity, and recognizing properties of those entities once successfully matched. This is not something we can do with Eidos I don't think within 8 hours.

From what I'm seeing using the Eidos framework, it is loading supervised models when calling different processors.

I'm looking into writing rules based on the original formulation above though.

issue12_eidos_output
jpfairbanks commented 5 years ago

This looks great! I agree that the output is encouraging. I think the rule Christine had in mind was approximately

"concept", "concept" where the first concept is more than 4 characters and the second concept is less equal one word.

I think that is a good first order approximation of a "variable definition" rule.

From the description of eidos that I got from @adarshp, I understood that you could write rules that take the output of other rules to define higher level extractions.

I don't know how to translate

"concept", "concept" where the first concept is more than 4 characters and the second concept is less equal one word.

into an eidos rule, but it should be within the scope for what eidos can detect.

Also I think it is interesting that the concept and property rules treat "at a per-capita rate $\Beta I$" differently from "at a per-capita rate $\gamma$" in that sentence.

scottagt commented 5 years ago

Thanks @jpfairbanks . I agree, it would be nice to be able to extract definitions like we were talking about for filling in human explainable generations.

So the rule-based portions of the system are defined from Odin grammars, and they link to this manual: https://arxiv.org/pdf/1509.07513.pdf I need to read it in detail; I'm not sure yet if Odin is being developed with definition extraction in mind where a phrase is recognized and associated with some other phrase as being a definition of it from the text. It sounds like Odin could support it if we just find the right level of POS to use when defining the rules. Here's an example from the manual

Hi @adarshp, can you point me to what James is referring to? I've looked at a few Odin related papers but I don't see any sort of description of using Odin to extract definitions using rules per se, just events, relationships, etc. Would you be able to comment on whether or not a system like Odin is suited to do that? I would imagine something like term definition would be discussed somewhere among other concepts like events and relationships given the generality of it.

scottagt commented 5 years ago

@jpfairbanks The attached image, page 16 of the manual, example 15, is pretty close to what we want. except, we want to specify that some previously recognized entity (e.g. "Concept") to be used as the trigger instead of the lemma. I've been looking over the manual and I read at one point a trigger needs to be token sequence e.g. a verb like "phosphorylation" (which could be defined as an event). If the trigger could be a previously recognized entity, then I think we could write definition extraction rules of different forms.

image
scottagt commented 5 years ago

@jpfairbanks Eidos has finished running on all the Epirecipe markdown documents, saved as JSON structured exports. Do you want to have those committed here or no?

jpfairbanks commented 5 years ago

Yes that example is very similar to what we want. The explanation I was referring to was a verbal explanation so I don't have a document for it unfortunately.

If they are small enough, attach them to the issue. Otherwise share them in box.

scottagt commented 5 years ago

@jpfairbanks Ah ok got it. I'll share on box. Also it's actually still running but will post what has been annotated so far. For actually looking at examples though, I recommend we use the web app annotation viewer, and we can do that tomorrow for specific examples but putting in the text to the web app.

jpfairbanks commented 5 years ago

Cool. I'll take a look and we can talk tomorrow.

adarshp commented 5 years ago

Hi @scottagt, I'm going to actually redirect your question to @marcovzla, who is the author of ODIN - I haven't personally written ODIN rules, but my sense is that a rule to extract definitions is very much in the scope of ODIN.

jpfairbanks commented 5 years ago

Thanks Adarsh.

@marcovzla, we are looking at this textbook and trying to pull out some useful information from the text. One thing that should be really easy to extract is the conceptual definitions of variables that will later be used in equations and code. You can see the example at the top of this thread.

Is there some resource for generally useful rules?

adarshp commented 5 years ago

Besides the ODIN manual, there is also a repo with some small examples, the ODIN Wiki and the REACH reader which is a mature project that uses ODIN rules to extract information about biochemical reactions. Perhaps these might be useful to look at?

scottagt commented 5 years ago

Adarsh, thanks for this information!


From: Adarsh Pyarelal notifications@github.com Sent: Tuesday, December 11, 2018 7:36:09 PM To: jpfairbanks/SemanticModels.jl Cc: Appling, Scott; Mention Subject: [BULK] Re: [jpfairbanks/SemanticModels.jl] Run eidos to find definitions from cookbook (#12)

Besides the ODIN manualhttps://arxiv.org/abs/1509.07513, there is also a repo with some small exampleshttps://github.com/clulab/odin-examples, the ODIN Wikihttps://github.com/clulab/processors/wiki/ODIN-(Open-Domain-INformer) and the REACH readerhttps://github.com/clulab/reach which is a mature project that uses ODIN rules to extract information about biochemical reactions. Perhaps these might be useful to look at?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/jpfairbanks/SemanticModels.jl/issues/12#issuecomment-446416917, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AYtl0N-POOwG9v7mistFZZx33sKCXnXkks5u4E95gaJpZM4ZN1D2.

jpfairbanks commented 5 years ago

It looks like we have a draft of a rule that catches the variable definitions.

-name: simple-np-def
  label: Definition
  priority: 2
  type: token
  pattern: @effect:Concept ("," @cause:Concept)

The next step is to define some more rules to catch more complicated definitions.

scottagt commented 5 years ago

Yes, I created that draft rule taking into account the nature of the web app hardcoding effect/cause arguments in order to visualize whatever Definitions are extract. Yes, I want to look at @crherlihy's list of definitions and develop at least two new rules.

jpfairbanks commented 5 years ago

So if we run Odin (specifically the ExtractAndExport application) we get text mentions that look like this:

{
  "documents": {
    "877918435": {
      "id": "input1.txt",
      "text": "This is text.\nText in a file.\nFiles are cool.\n",
      "sentences": [
        {
          "words": [
            "This",
            "is",
            "text",
            "."
          ],
          "startOffsets": [
            0,
            5,
            8,
            12
          ],
          "endOffsets": [
            4,
            7,
            12,
            13
          ],
          "raw": [
            "This",
            "is",
            "text",
            "."
          ],
          "tags": [
            "DT",
            "VBZ",
            "NN",
            "."
          ],
          "lemmas": [
            "this",
            "be",
            "text",
            "."
          ],
          "entities": [
            "O",
            "O",
            "O",
            "O"
          ],
          "norms": [
            "O",
            "O",
            "O",
            "O"
          ],
          "chunks": [
            "B-NP",
            "B-VP",
            "B-NP",
            "O"
          ],
          "graphs": {
            "universal-enhanced": {
              "edges": [
                {
                  "source": 2,
                  "destination": 0,
                  "relation": "nsubj"
                },
                {
                  "source": 2,
                  "destination": 1,
                  "relation": "cop"
                },
                {
                  "source": 2,
                  "destination": 3,
                  "relation": "punct"
                }
              ],
              "roots": [
                2
              ]
            },
            "universal-basic": {
              "edges": [
                {
                  "source": 2,
                  "destination": 0,
                  "relation": "nsubj"
                },
                {
                  "source": 2,
                  "destination": 1,
                  "relation": "cop"
                },
                {
                  "source": 2,
                  "destination": 3,
                  "relation": "punct"
                }
              ],
              "roots": [
                2
              ]
            }
          }
        },
        {
          "words": [
            "Text",
            "in",
            "a",
            "file",
            "."
          ],
          "startOffsets": [
            14,
            19,
            22,
            24,
            28
          ],
          "endOffsets": [
            18,
            21,
            23,
            28,
            29
          ],
          "raw": [
            "Text",
            "in",
            "a",
            "file",
            "."
          ],
          "tags": [
            "VB",
            "IN",
            "DT",
            "NN",
            "."
          ],
          "lemmas": [
            "text",
            "in",
            "a",
            "file",
            "."
          ],
          "entities": [
            "O",
            "O",
            "O",
            "O",
            "O"
          ],
          "norms": [
            "O",
            "O",
            "O",
            "O",
            "O"
          ],
          "chunks": [
            "B-VP",
            "B-PP",
            "B-NP",
            "I-NP",
            "O"
          ],
          "graphs": {
            "universal-enhanced": {
              "edges": [
                {
                  "source": 3,
                  "destination": 2,
                  "relation": "det"
                },
                {
                  "source": 3,
                  "destination": 1,
                  "relation": "case"
                },
                {
                  "source": 0,
                  "destination": 3,
                  "relation": "nmod_in"
                },
                {
                  "source": 0,
                  "destination": 4,
                  "relation": "punct"
                }
              ],
              "roots": [
                0
              ]
            },
            "universal-basic": {
              "edges": [
                {
                  "source": 3,
                  "destination": 2,
                  "relation": "det"
                },
                {
                  "source": 3,
                  "destination": 1,
                  "relation": "case"
                },
                {
                  "source": 0,
                  "destination": 3,
                  "relation": "nmod"
                },
                {
                  "source": 0,
                  "destination": 4,
                  "relation": "punct"
                }
              ],
              "roots": [
                0
              ]
            }
          }
        },
        {
          "words": [
            "Files",
            "are",
            "cool",
            "."
          ],
          "startOffsets": [
            30,
            36,
            40,
            44
          ],
          "endOffsets": [
            35,
            39,
            44,
            45
          ],
          "raw": [
            "Files",
            "are",
            "cool",
            "."
          ],
          "tags": [
            "NNS",
            "VBP",
            "JJ",
            "."
          ],
          "lemmas": [
            "file",
            "be",
            "cool",
            "."
          ],
          "entities": [
            "O",
            "O",
            "O",
            "O"
          ],
          "norms": [
            "O",
            "O",
            "O",
            "O"
          ],
          "chunks": [
            "B-NP",
            "B-VP",
            "B-ADJP",
            "O"
          ],
          "graphs": {
            "universal-enhanced": {
              "edges": [
                {
                  "source": 2,
                  "destination": 0,
                  "relation": "nsubj"
                },
                {
                  "source": 2,
                  "destination": 1,
                  "relation": "cop"
                },
                {
                  "source": 2,
                  "destination": 3,
                  "relation": "punct"
                }
              ],
              "roots": [
                2
              ]
            },
            "universal-basic": {
              "edges": [
                {
                  "source": 2,
                  "destination": 0,
                  "relation": "nsubj"
                },
                {
                  "source": 2,
                  "destination": 1,
                  "relation": "cop"
                },
                {
                  "source": 2,
                  "destination": 3,
                  "relation": "punct"
                }
              ],
              "roots": [
                2
              ]
            }
          }
        }
      ]
    }
  },
  "mentions": [
    {
      "type": "TextBoundMention",
      "id": "T:689263908",
      "text": "This",
      "labels": [
        "Concept",
        "Entity"
      ],
      "tokenInterval": {
        "start": 0,
        "end": 1
      },
      "characterStartOffset": 0,
      "characterEndOffset": 4,
      "sentence": 0,
      "document": "877918435",
      "keep": true,
      "foundBy": "simple-np"
    },
    {
      "type": "TextBoundMention",
      "id": "T:-1008405019",
      "text": "is",
      "labels": [
        "Concept",
        "Entity"
      ],
      "tokenInterval": {
        "start": 1,
        "end": 2
      },
      "characterStartOffset": 5,
      "characterEndOffset": 7,
      "sentence": 0,
      "document": "877918435",
      "keep": true,
      "foundBy": "simple-vp"
    },
    {
      "type": "TextBoundMention",
      "id": "T:2114757955",
      "text": "Text in a file",
      "labels": [
        "Concept",
        "Entity"
      ],
      "tokenInterval": {
        "start": 0,
        "end": 4
      },
      "characterStartOffset": 14,
      "characterEndOffset": 28,
      "sentence": 1,
      "document": "877918435",
      "keep": true,
      "foundBy": "simple-vp"
    },
    {
      "type": "TextBoundMention",
      "id": "T:174456323",
      "text": "Files",
      "labels": [
        "Concept",
        "Entity"
      ],
      "tokenInterval": {
        "start": 0,
        "end": 1
      },
      "characterStartOffset": 30,
      "characterEndOffset": 35,
      "sentence": 2,
      "document": "877918435",
      "keep": true,
      "foundBy": "simple-np"
    },
    {
      "type": "TextBoundMention",
      "id": "T:-1926935776",
      "text": "are",
      "labels": [
        "Concept",
        "Entity"
      ],
      "tokenInterval": {
        "start": 1,
        "end": 2
      },
      "characterStartOffset": 36,
      "characterEndOffset": 39,
      "sentence": 2,
      "document": "877918435",
      "keep": true,
      "foundBy": "simple-vp"
    }
  ]
}

What is the ideal storage format for them?

jpfairbanks commented 5 years ago

@scottagt can you bring the new rules that you wrote into the new repo https://github.com/ml4ai/automates/tree/master/text_reading/src/main/resources/org/clulab/aske_automates/grammars and add them here. then rerun it on the cookbook?

scottagt commented 5 years ago

@jpfairbanks Sure thing. Do we have our own branch of that repo that we want to use as a staging place? And then before making a push request to their repo, I'd like to check with them about pushing our new rules file and where they want to put it.

jpfairbanks commented 5 years ago

go ahead and make a fork of their master branch under your account, or push a branch to this fork that I made. https://github.com/jpfairbanks/automates

scottagt commented 5 years ago

@jpfairbanks I got to investigate some new options and got more familiar with the new automates code base. I've ported over the new rules into the system and made changes in a branch on our fork of their repo. I'm going to submit a pull request and you can have a look. Attached are sample input and outputs from testing. The output is the json file (had to zip it). "Definition" was found as expected. Going to close this one since it's beginning to meander. :)

input1.txt

input1.txt.json.zip