pul-kit123 / spacy

MIT License
0 stars 0 forks source link

[CLOSED] Question: NER with Spacy #3

Closed pul-kit123 closed 8 months ago

pul-kit123 commented 8 months ago

Issue by viksit Saturday Feb 28, 2015 at 21:12 GMT Originally opened as https://github.com/explosion/spaCy/issues/30


Are there any plans of adding NER capabilities to Spacy soon? Any recommendations on the most modern techniques to do so, if not? (Eg, perhaps using the word vector representation?)

pul-kit123 commented 8 months ago

Comment by syllog1sm Wednesday Mar 11, 2015 at 04:09 GMT


I'm not really happy with any of the existing algorithms, so I've been working on a novel shift-reduce approach. Briefly, where previous work usually encodes the structure into sequence tags, so that a finite-state machine can be used, I think it makes more sense to use a push-down automaton, now that work in parsing with shift-reduce grammars is so well understood.

I'm just starting to get results for this. Currently accuracy is only 77% on OntoNotes, where the Stanford NER system reportedly gets around 84%. I still need to do a lot of bug-fixing and tuning, and I'm not using gazetteers or any semi-supervised learning at the moment.

So, in short: yes, NER is planned, and the bulk of the work is done. It remains to be seen whether my approach will hit comparable accuracy to previous work, but imo it should. Once the accuracy is good, I then need to design and implement the Python API, and write the testing and deployment code. Probably about 1 month all up, given other things I'm working on.

pul-kit123 commented 8 months ago

Comment by viksit Wednesday Mar 11, 2015 at 06:06 GMT


Thanks, that's an interesting approach. Are there any specific papers you recommend for PDA based NER?

Also, are you inviting code/collaboration on this yet?

pul-kit123 commented 8 months ago

Comment by syllog1sm Wednesday Mar 11, 2015 at 06:30 GMT


As far as I know PDA for NER is a new idea, since most of the previous work uses HMMs and CRFs. If it works, I'll write it up.

I need to set up the contributors agreement, but then I could accept contributions. But, I think it's easiest if I do the research parts myself. Collaborating on that gets complicated.

If you want to weigh in on what sort of API you'd like to see though, that would be very welcome.

pul-kit123 commented 8 months ago

Comment by viksit Wednesday Mar 11, 2015 at 21:28 GMT


Ah, I didn't realize it hasn't been tried before - I remember coming across a chinese NER system that used PDAs, but I can't find that paper. Would you be interested in sharing some high level thoughts on the PDA/NER approach that you're taking?

Re: collaborating on the research parts - just an idea - it might be interesting to have a shared ipynb or some such, on one of the github style research collaboration platforms.

Definitely, let me think about the APIs. I've always thought that the GATE or UIMA style, and even the Stanford NER APIs have been super heavy weight.

It would be good to have a visual representation of the parse tree, like NLTKs as this progresses,

(S
  Over/IN
  (NP a/DT cup/NN)
  of/IN
  (NP coffee/NN)
  ,/,
  (NP Mr./NNP Stone/NNP)
  told/VBD
  (NP his/PRP$ story/NN)
  ./.) 
pul-kit123 commented 8 months ago

Comment by syllog1sm Thursday Mar 26, 2015 at 14:13 GMT


Quick update:

This is progressing well: I'm now getting 81% on the OntoNotes WSJ corpus. I expect gazzetteers from Wikidata will bring this in line with current state-of-the-art.

It's hard to say, but this might be ready by the end of April.

pul-kit123 commented 8 months ago

Comment by honnibal Monday Apr 13, 2015 at 21:28 GMT


NER now included, although the model still needs accuracy improvements. Currently it's getting 82% F on OntoNotes, and 86% on CoNLL '03. State-of-the-art is around 85% and 90% on these benchmarks. Improvements are in the works.

pul-kit123 commented 8 months ago

Comment by viksit Monday Apr 13, 2015 at 21:28 GMT


Sweet - is there a pointer on usage?

pul-kit123 commented 8 months ago

Comment by lock[bot] Wednesday May 09, 2018 at 18:31 GMT


This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.