snorkel-team / snorkel

A system for quickly generating training data with weak supervision
https://snorkel.org
Apache License 2.0
5.82k stars 857 forks source link

Sentence parsed incorrectly in table-dev branch #424

Closed kuleshov closed 8 years ago

kuleshov commented 8 years ago

Using the latest commit in table-dev, I want to parse this sentence:

Sentence(Document('17903294', Corpus (GWAS Text Corpus)), 14, u'Elevated circulating levels of hemostatic factors, such as fibrinogen [1-3], plasminogen activator inhibitor (PAI-1) [4,5], von Willebrand factor (vWF) [6], tissue plasminogen activator (tPA) [4,5,7], factor VII (FVII) [8], and D-dimer [9,10] are linked to the development of atherothrombosis and are risk markers for coronary heart disease (CHD), stroke and other cardiovascular disease (CVD) events.')

I want to match the following substring: factor VII (FVII)

For that, I first instantiate ngrams6 = Ngrams(n_max=6) and then

for ngram in ngrams6.apply(s): print ngram

which gives:

41 TemporarySpan("[4,5,7], factor VII", context=None, chars=[192,210], words=[40,45]) 42 TemporarySpan("4,5,7], factor VII (FVII", context=None, chars=[193,216], words=[41,47]) 43 TemporarySpan("], factor VII (FVII", context=None, chars=[198,216], words=[42,47]) 44 TemporarySpan(", factor VII (FVII) [8]", context=None, chars=[199,221], words=[43,51]) 45 TemporarySpan("factor VII (FVII) [8], ", context=None, chars=[201,223], words=[44,52]) 46 TemporarySpan("VII (FVII) [8", context=None, chars=[208,220], words=[45,50]) 47 TemporarySpan("(FVII) [8], an", context=None, chars=[212,225], words=[46,53]) 48 TemporarySpan("FVII) [8],", context=None, chars=[213,222], words=[47,52])

Notice that it never generates "factor VII (FVII)". It seems like it treats ") [8]" as a single entity.

So I think there is a bug. Can someone help me debug?

bhancock8 commented 8 years ago

Hey Volodymyr, I'd be happy to take a look later today and help you debug. (I've got jury duty this morning. :/) But could you point me to a notebook where the issue is reproduced?

kuleshov commented 8 years ago

Okay, I figured out what was the bug. When you are calculating word offsets in Ngrams, you use len(context.words[...])

But words gets loaded from CoreNLP, which sets word to e.g. -LRB-. Its len is 5, but it should be 1 (cause it's just one parenthesis character).

The easiest fix that I just did is to put the original word (i.e. '(' ) into word, and keep -LRB- in the lemma field.

The cleanest way to fix it is to change the schema and add an extra field for the original word (or its length), but that would take a bit more effort.

ajratner commented 8 years ago

Hmm, I think this is fixed in master: https://github.com/HazyResearch/snorkel/blob/master/snorkel/parser.py#L242

But maybe didn't make it to your branch?

kuleshov commented 8 years ago

I guess not, but all is good now.

ajratner commented 8 years ago

Okay sorry about that, but glad it's fixed :)

On Mon, Sep 12, 2016 at 9:08 PM Volodymyr Kuleshov notifications@github.com wrote:

I guess not, but all is good now.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/HazyResearch/snorkel/issues/424#issuecomment-246568173, or mute the thread https://github.com/notifications/unsubscribe-auth/ABgw_fm9QN3UlxU92DWWaPmJweYGF5Q-ks5qpiG2gaJpZM4J6HPY .

bhancock8 commented 8 years ago

Yeah, tables_dev and master separated quite a while ago now. Fortunately, the new tables_merge branch should be folded into master by the end of the week (or early next week), and we'll be working with all the latest changes in master again.