Closed kuleshov closed 8 years ago
Hey Volodymyr, I'd be happy to take a look later today and help you debug. (I've got jury duty this morning. :/) But could you point me to a notebook where the issue is reproduced?
Okay, I figured out what was the bug. When you are calculating word offsets in Ngrams, you use len(context.words[...])
But words gets loaded from CoreNLP, which sets word to e.g. -LRB-. Its len is 5, but it should be 1 (cause it's just one parenthesis character).
The easiest fix that I just did is to put the original word (i.e. '(' ) into word, and keep -LRB- in the lemma field.
The cleanest way to fix it is to change the schema and add an extra field for the original word (or its length), but that would take a bit more effort.
Hmm, I think this is fixed in master: https://github.com/HazyResearch/snorkel/blob/master/snorkel/parser.py#L242
But maybe didn't make it to your branch?
I guess not, but all is good now.
Okay sorry about that, but glad it's fixed :)
On Mon, Sep 12, 2016 at 9:08 PM Volodymyr Kuleshov notifications@github.com wrote:
I guess not, but all is good now.
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/HazyResearch/snorkel/issues/424#issuecomment-246568173, or mute the thread https://github.com/notifications/unsubscribe-auth/ABgw_fm9QN3UlxU92DWWaPmJweYGF5Q-ks5qpiG2gaJpZM4J6HPY .
Yeah, tables_dev and master separated quite a while ago now. Fortunately, the new tables_merge branch should be folded into master by the end of the week (or early next week), and we'll be working with all the latest changes in master again.
Using the latest commit in table-dev, I want to parse this sentence:
Sentence(Document('17903294', Corpus (GWAS Text Corpus)), 14, u'Elevated circulating levels of hemostatic factors, such as fibrinogen [1-3], plasminogen activator inhibitor (PAI-1) [4,5], von Willebrand factor (vWF) [6], tissue plasminogen activator (tPA) [4,5,7], factor VII (FVII) [8], and D-dimer [9,10] are linked to the development of atherothrombosis and are risk markers for coronary heart disease (CHD), stroke and other cardiovascular disease (CVD) events.')
I want to match the following substring: factor VII (FVII)
For that, I first instantiate ngrams6 = Ngrams(n_max=6) and then
for ngram in ngrams6.apply(s): print ngram
which gives:
41 TemporarySpan("[4,5,7], factor VII", context=None, chars=[192,210], words=[40,45]) 42 TemporarySpan("4,5,7], factor VII (FVII", context=None, chars=[193,216], words=[41,47]) 43 TemporarySpan("], factor VII (FVII", context=None, chars=[198,216], words=[42,47]) 44 TemporarySpan(", factor VII (FVII) [8]", context=None, chars=[199,221], words=[43,51]) 45 TemporarySpan("factor VII (FVII) [8], ", context=None, chars=[201,223], words=[44,52]) 46 TemporarySpan("VII (FVII) [8", context=None, chars=[208,220], words=[45,50]) 47 TemporarySpan("(FVII) [8], an", context=None, chars=[212,225], words=[46,53]) 48 TemporarySpan("FVII) [8],", context=None, chars=[213,222], words=[47,52])
Notice that it never generates "factor VII (FVII)". It seems like it treats ") [8]" as a single entity.
So I think there is a bug. Can someone help me debug?