snorkel-team / snorkel

A system for quickly generating training data with weak supervision
https://snorkel.org
Apache License 2.0
5.81k stars 857 forks source link

Matcher functions problems #739

Closed gengkunling closed 6 years ago

gengkunling commented 7 years ago

I was trying to build an organization matcher using "snorkel.matchers.OrganizationMatcher". However, the matcher seems only works for Stanford CoreNLP, but not Spacy.

I looked into the code and found that:

kwargs['rgx'] = 'ORGANIZATION' 

It does not work for SpaCy because for the NER results, spaCy returns "ORG" rather than ''ORGANIZATION''.

Similar issues for other matchers. Is there a simple way to fix this?

ajratner commented 7 years ago

Hi @gengkunling - I pushed a fix, will merge in as soon as tests pass.

More broadly, these matchers are all subclasses of snorkel.matchers.RegexMatchEach, which matches a specified attribute (e.g. the NER tags) of each token against a supplied regex. So in your case, for example, you could instead just use:

org_matcher = RegexMatchEach(rgx='ORG', attrib='ner_tags')

Hope this helps! Alex

ajratner commented 6 years ago

This should be fixed? If not re-open