snorkel-team / snorkel

A system for quickly generating training data with weak supervision
https://snorkel.org
Apache License 2.0
5.79k stars 859 forks source link

Reading Annotations from BioC XML #713

Closed MattMorgis closed 7 years ago

MattMorgis commented 7 years ago

Hi Snorkel Team,

This issue is certainly more of a question that I've been stuck on all day rather than a bug or an issue.

For what it's worth, we've been experimenting with Snorkel at Elsevier. I've been having a blast using it and following along with the progress the past few weeks, and looking forward to seeing it continue.

With that said, I've been picking apart the CDR demo and how the tagging is done for both Chemical and Disease

I am going to run through my understand of what is happening and what I am attempting, and ~hopefully~ maybe one of you can point me in the right direction or fill in the blanks.

Given the following document:

<document>
    <id>227508</id>
    <passage>
        <infon key="type">title</infon>
        <offset>0</offset>
        <text>Naloxone reverses the antihypertensive effect of clonidine.</text>
        <annotation id='0'>
            <infon key="type">Chemical</infon>
            <infon key="MESH">D009270</infon>
            <location offset='0' length='8' />
            <text>Naloxone</text>
        </annotation>
        <annotation id='1'>
            <infon key="type">Chemical</infon>
            <infon key="MESH">D003000</infon>
            <location offset='49' length='9' />
            <text>clonidine</text>
        </annotation>
    </passage>
</document>

It appears to me that in the CDR demo, Snorkel is not looking at the <annotation> tags in the XML? Instead, Snorkel will get the sentence start and ending indexes and then references those indexes in the unary_tags.pkl.bz2 dictionary. Additionally, if those tags fail, it appears to look up every word of the sentence in another Chemical and/or Disease dictionary.

Is this actually what is happening? Is it ignoring the <annotation> tag in the BioC XML? How did the unary_tags dictionary get built?

The reason I ask, is that I was able to use DNorm to tag Diseases in a corpus of text, but am struggling with how to interrupt those annotations and then how to add the entity tags to the database in the same manner that CDR is doing it.

DNORM output:

<document>
    <id>9819429</id>
    <passage>
        <infon key="type">title</infon>
        <offset>0</offset>
        <text>The leukemic protein core binding factor beta (CBFbeta)-smooth-muscle myosin heavy chain sequesters CBFalpha2 into cytoskeletal filaments and aggregates</text>
    </passage>
    <passage>
        <infon key="type">abstract</infon>
        <offset>153</offset>
        <text>The fusion gene CBFB-MYH11 is generated by the chromosome 16 inversion associated with acute myeloid leukemias. This gene encodes a chimeric protein involving the core binding factor beta (CBFbeta) and the smooth-muscle myosin heavy chain (SMMHC). Mouse model studies suggest that this chimeric protein CBFbeta-SMMHC dominantly suppresses the function of CBF, a heterodimeric transcription factor composed of DNA binding subunits (CBFalpha1 to 3) and a non-DNA binding subunit (CBFbeta). This dominant suppression results in the blockage of hematopoiesis in mice and presumably contributes to leukemogenesis. We used transient-transfection assays, in combination with immunofluorescence and green fluorescent protein-tagged proteins, to monitor subcellular localization of CBFbeta-SMMHC, CBFbeta, and CBFalpha2 (also known as AML1 or PEBP2alphaB). When expressed individually, CBFalpha2 was located in the nuclei of transfected cells, whereas CBFbeta was distributed throughout the cell. On the other hand, CBFbeta-SMMHC formed filament-like structures that colocalized with actin filaments. Upon cotransfection, CBFalpha2 was able to drive localization of CBFbeta into the nucleus in a dose-dependent manner. In contrast, CBFalpha2 colocalized with CBFbeta-SMMHC along the filaments instead of localizing to the nucleus. Deletion of the CBFalpha-interacting domain within CBFbeta-SMMHC abolished this CBFalpha2 sequestration, whereas truncation of the C-terminal-end SMMHC domain led to nuclear localization of CBFbeta-SMMHC when coexpressed with CBFalpha2. CBFalpha2 sequestration by CBFbeta-SMMHC was further confirmed in vivo in a knock-in mouse model. These observations suggest that CBFbeta-SMMHC plays a dominant negative role by sequestering CBFalpha2 into cytoskeletal filaments and aggregates, thereby disrupting CBFalpha2-mediated regulation of gene expression.</text>
        <annotation id="1">
            <infon key="MESH">D015470</infon>
            <infon key="type">Disease</infon>
            <location offset="240" length="23"/>
            <text>acute myeloid leukemias</text>
        </annotation>
        <annotation id="2">
            <infon key="MESH">C536227</infon>
            <infon key="type">Disease</infon>
            <location offset="694" length="13"/>
            <text>hematopoiesis</text>
        </annotation>
    </passage>
</document>
ajratner commented 7 years ago

Hi @MattMorgis ,

Will try to answer your question a bit later when I have time- but two things quickly now: (1) Thanks so much for trying out Snorkel and for all the great feedback!! We really appreciate it! (2) This preprocessing step of loading in the data (and entity tags in this case) is not part of the "core" Snorkel--mainly the code you reference is in there to get the CDR demo working--and so we'd encourage you to use your own preprocessing if makes more sense!

Thanks, Alex

MattMorgis commented 7 years ago

Great, looking forward to seeing your more detailed response!

I understand that this isn't the "core" of Snorkel, but I am really struggling to connect the dots here.

The CDR tutorial appears to show how to use a custom entity tagger, but again, I'm just very confused at what it's actually doing under the hood.

Did TaggerOne add the <annotation> tags to the CDR.BioC.xml file?

I'm also then very confused by this line:

We discard all of the entity mention annotations and assume we have access to a state-of-the-art entity tagger (see Part I) to identify chemical and disease mentions, and link them to their canonical IDs.

Does that mean that the <annotation> tags in the XML are ignored? Again, were they put in there by TaggerOne? If not, how did they get added in? And if they are in fact ignored, why?

When it says "we assume we have access to a state-of-the-art entity tagger (see Part 1)" - should this demonstrate how to use TaggerOne? However, instead, it's basically faking it out with the aforementioned dictionaries?

Again, my goal is to be able to read in and use the annotations left in XML by DNorm and mark that word or words as a Disease entity in the database. Am I going about this wrong? Should DNorm not be run first? Should I be calling DNorm as it's preprocessing the documents and reading in every sentence?

MattMorgis commented 7 years ago

Pinging for a follow up. I am still blocked by this and I have moved on to less-than-desirable tagging methods. If anyone has any time, can someone please offer some more insight into how the CDR corpus is being tagged and how that would map to actually using TaggerOne or another NER tagger.

henryre commented 7 years ago

Hi @MattMorgis. The <annotation> tags in CDR.BioC.xml are done by hand, but it's unrealistic to expect this type of human supervision for most real world tasks. So to mimic a realistic scenario, we ignore those and use TaggerOne's CDR annotations. These were preprocessed into a dictionary for easy lookup, which is stored in taggerone_unary_tags_cdr.pkl.bz2. There's also a fallback lookup to see if the exact word appears in MESH in case TaggerOne tagged it as a Chemical or Disease, but didn't tag it with an ID.

As @ajratner was getting at, this just provides an example of how to incorporate external resources (like writing the CDRTagger and TaggerOneTagger classes). For a new corpus (where you presumably don't have human annotations), you can certainly run DNorm and/or TaggerOne first, then process their outputs so you can add entity tags in Snorkel.

MattMorgis commented 7 years ago

Thanks for the reply @henryre! This clarifies what is happening in that TaggerOneTagger.tag function.

I think what I am mainly trying to get at is was there a tool that was used to preprocess the the unary_tags.pkl.bz2 dictionary? DNorm adds the same <annotation> tags that I am understanding were done by hand for CDR. I am still unclear how to read in these annotations from the XML.

Did you guys make the corresponding unary_tags.pkl.bz2 by hand too? Was it done using a tool that I can use to process them as well, such as ddbiolib or snorkel-biocorpus (I poked around these, but couldn't figure out exactly what they were doing.)

That is where I am still stuck. DNorm has added all of the <annotation> tags to my BioC.xml, however when I dove into the TaggerOne.tag function, I saw it use the unary_tags.pkl.bz2 and the fallback to the Chemical and Disease MESH dictionaries. I have my own .xml with the same annotation tags - just placed there by DNorm and not by hand - yet I am unsure how to translate that to what is happening in the TaggerOneTagger.tag function since I don't want to ignore them.

you can certainly run DNorm and/or TaggerOne first, then process their outputs so you can add entity tags in Snorkel.

Can someone please take that quote from the previous answer one step further and explain at a high-level how the mechanics of that would work? Maybe an easy way to answer what I'm looking for: if you weren't going to ignore the <annotations> in CDR, how would you process them? And/or, how did you build the unary_tags.pkl.bz2 dictionary from the generated TaggerOne annotations?

henryre commented 7 years ago

There was no special tool used to do this. You can parse the XML document, look for annotation tags, then save the document id, as well as the annotation tag's location and infon data in a dictionary so that you can tag the appropriate tokens.

MattMorgis commented 7 years ago

Okay! So my takeaway is that the CDR data was literally all processed manually and there is no out-of-the-box way to interact with or read in <annotation> tags in BioC formatted-documents. We'll have to write our own.

Is there a plan to incorporate this into Snorkel? Would it be worth submitting a PR should we go down this route or is this something you want to specifically leave to users of Snorkel to implement on a case-by-case basis?

MattMorgis commented 7 years ago

If this PR is something that you guys want to include in Snorkel, feel free to re-open and keep the conversation going - I'd love to work on it under some guidance from the team. Otherwise, thanks for taking the time to respond and give clarification!