Closed MattMorgis closed 7 years ago
Hi @MattMorgis ,
Will try to answer your question a bit later when I have time- but two things quickly now: (1) Thanks so much for trying out Snorkel and for all the great feedback!! We really appreciate it! (2) This preprocessing step of loading in the data (and entity tags in this case) is not part of the "core" Snorkel--mainly the code you reference is in there to get the CDR demo working--and so we'd encourage you to use your own preprocessing if makes more sense!
Thanks, Alex
Great, looking forward to seeing your more detailed response!
I understand that this isn't the "core" of Snorkel, but I am really struggling to connect the dots here.
The CDR tutorial appears to show how to use a custom entity tagger, but again, I'm just very confused at what it's actually doing under the hood.
Did TaggerOne add the <annotation>
tags to the CDR.BioC.xml
file?
I'm also then very confused by this line:
We discard all of the entity mention annotations and assume we have access to a state-of-the-art entity tagger (see Part I) to identify chemical and disease mentions, and link them to their canonical IDs.
Does that mean that the <annotation>
tags in the XML are ignored? Again, were they put in there by TaggerOne? If not, how did they get added in? And if they are in fact ignored, why?
When it says "we assume we have access to a state-of-the-art entity tagger (see Part 1)" - should this demonstrate how to use TaggerOne? However, instead, it's basically faking it out with the aforementioned dictionaries?
Again, my goal is to be able to read in and use the annotations left in XML
by DNorm and mark that word or words as a Disease
entity in the database. Am I going about this wrong? Should DNorm not be run first? Should I be calling DNorm as it's preprocessing the documents and reading in every sentence?
Pinging for a follow up. I am still blocked by this and I have moved on to less-than-desirable tagging methods. If anyone has any time, can someone please offer some more insight into how the CDR corpus is being tagged and how that would map to actually using TaggerOne or another NER tagger.
Hi @MattMorgis. The <annotation>
tags in CDR.BioC.xml
are done by hand, but it's unrealistic to expect this type of human supervision for most real world tasks. So to mimic a realistic scenario, we ignore those and use TaggerOne's CDR annotations. These were preprocessed into a dictionary for easy lookup, which is stored in taggerone_unary_tags_cdr.pkl.bz2
. There's also a fallback lookup to see if the exact word appears in MESH in case TaggerOne tagged it as a Chemical or Disease, but didn't tag it with an ID.
As @ajratner was getting at, this just provides an example of how to incorporate external resources (like writing the CDRTagger
and TaggerOneTagger
classes). For a new corpus (where you presumably don't have human annotations), you can certainly run DNorm and/or TaggerOne first, then process their outputs so you can add entity tags in Snorkel.
Thanks for the reply @henryre! This clarifies what is happening in that TaggerOneTagger.tag
function.
I think what I am mainly trying to get at is was there a tool that was used to preprocess the the unary_tags.pkl.bz2
dictionary? DNorm adds the same <annotation>
tags that I am understanding were done by hand for CDR. I am still unclear how to read in these annotations from the XML.
Did you guys make the corresponding unary_tags.pkl.bz2
by hand too? Was it done using a tool that I can use to process them as well, such as ddbiolib or snorkel-biocorpus (I poked around these, but couldn't figure out exactly what they were doing.)
That is where I am still stuck. DNorm has added all of the <annotation>
tags to my BioC.xml
, however when I dove into the TaggerOne.tag
function, I saw it use the unary_tags.pkl.bz2
and the fallback to the Chemical
and Disease
MESH dictionaries. I have my own .xml
with the same annotation tags - just placed there by DNorm and not by hand - yet I am unsure how to translate that to what is happening in the TaggerOneTagger.tag
function since I don't want to ignore them.
you can certainly run DNorm and/or TaggerOne first, then process their outputs so you can add entity tags in Snorkel.
Can someone please take that quote from the previous answer one step further and explain at a high-level how the mechanics of that would work? Maybe an easy way to answer what I'm looking for: if you weren't going to ignore the <annotations>
in CDR, how would you process them? And/or, how did you build the unary_tags.pkl.bz2
dictionary from the generated TaggerOne annotations?
There was no special tool used to do this. You can parse the XML document, look for annotation
tags, then save the document id
, as well as the annotation
tag's location
and infon
data in a dictionary so that you can tag the appropriate tokens.
Okay! So my takeaway is that the CDR data was literally all processed manually and there is no out-of-the-box way to interact with or read in <annotation>
tags in BioC
formatted-documents. We'll have to write our own.
Is there a plan to incorporate this into Snorkel? Would it be worth submitting a PR should we go down this route or is this something you want to specifically leave to users of Snorkel to implement on a case-by-case basis?
If this PR is something that you guys want to include in Snorkel, feel free to re-open and keep the conversation going - I'd love to work on it under some guidance from the team. Otherwise, thanks for taking the time to respond and give clarification!
Hi Snorkel Team,
This issue is certainly more of a question that I've been stuck on all day rather than a bug or an issue.
For what it's worth, we've been experimenting with Snorkel at Elsevier. I've been having a blast using it and following along with the progress the past few weeks, and looking forward to seeing it continue.
With that said, I've been picking apart the CDR demo and how the tagging is done for both
Chemical
andDisease
I am going to run through my understand of what is happening and what I am attempting, and ~hopefully~ maybe one of you can point me in the right direction or fill in the blanks.
Given the following document:
It appears to me that in the CDR demo, Snorkel is not looking at the
<annotation>
tags in the XML? Instead, Snorkel will get the sentence start and ending indexes and then references those indexes in theunary_tags.pkl.bz2
dictionary. Additionally, if those tags fail, it appears to look up every word of the sentence in another Chemical and/or Disease dictionary.Is this actually what is happening? Is it ignoring the
<annotation>
tag in the BioC XML? How did theunary_tags
dictionary get built?The reason I ask, is that I was able to use DNorm to tag Diseases in a corpus of text, but am struggling with how to interrupt those annotations and then how to add the entity tags to the database in the same manner that CDR is doing it.
DNORM output: