sloria / textblob-aptagger

*Deprecated* A fast and accurate part-of-speech tagger for TextBlob.
MIT License
103 stars 40 forks source link

Bug fix: Start symbols #1

Open syllog1sm opened 11 years ago

syllog1sm commented 11 years ago

Found a bug in my implementation.

On line 72 of taggers.py we have this:

prev, prev2 = self.START

This initialises the tag history features to the dummy START symbol. These actually need to be reinitialised every sentence.

This bug is unlikely to significantly effect accuracy. What will happen currently is that the "previous tag" feature will almost always be set to ".", as the previous sentence probably ended with a period. The correct value is the dummy start symbol, but the features will likely reflect the same information.

If you fix the bug, you'll need to regenerate the model. This is the weakness of having a binary model committed to the repo...The old model is in the history now. Once we've made three or four updates, the repository will be quite large.

sloria commented 11 years ago

Ok, I've made the change. Thanks for catching it. I haven't re-trained the model because I wasn't sure what corpus you originally used. Could you help me with this? BTW, I've added you as a collaborator so you have direct commit access. If you'd like to be maintainer of this, I will gladly pass over ownership of the repo to you.

If the size of the git file tree becomes too large, we can stop committing it to the repo and only publish it with the package on the PyPI. For the short-term, though, I think committing it will be fine so long as we limit the amount of re-trainings.

syllog1sm commented 11 years ago

Ah, don't make the change before we update the model! If the run-time feature extraction doesn't match the feature extraction used during training, accuracy often goes down substantially.

On Mon, Nov 18, 2013 at 4:01 PM, Steven Loria notifications@github.comwrote:

Ok, I've made the change. Thanks for catching it. I haven't re-trained the model because I wasn't sure what corpus you originally used. Could you help me with this? BTW, I've added you as a collaborator so you have direct commit access. If you'd like to be maintainer of this, I will gladly pass over ownership of the repo to you.

If the size of the git file tree becomes too large, we can stop committing it to the repo and only publish it with the package on the PyPI. For the short-term, though, I think committing it will be fine so long as we limit the amount of re-trainings.

— Reply to this email directly or view it on GitHubhttps://github.com/sloria/textblob-aptagger/issues/1#issuecomment-28675524 .

sloria commented 11 years ago

I haven't published the update to the PyPI; only committed to the dev branch. Let's make all necessary changes before committing the updated model file to the repo.

sloria commented 10 years ago

@syllog1sm Just to follow up: could I get your assistance in retraining the model? Also, how do you feel about transfering ownership of this repo to you? This is your hard work, after all. =)

syllog1sm commented 10 years ago

Hi Steven,

Actually could you email me? honnibal@gmail.com

It's worth having a chat about this stuff. The thing is, the code for that project isn't that valuable; what makes it valuable is the data --- which the LDC keeps gated, and is quite expensive. We may be able to distribute trained models, but if we're doing that, maybe we don't want to use the demo code I wrote for the blog post.

It's also worth chatting about the follow-up post: http://honnibal.wordpress.com/2013/12/18/a-simple-fast-algorithm-for-natural-language-dependency-parsing/ . I don't know whether you saw it!

On Tue, Sep 16, 2014 at 6:28 AM, Steven Loria notifications@github.com wrote:

@syllog1sm https://github.com/syllog1sm Just to follow up: could I get your assistance in retraining the model? Also, how do you feel about transfering ownership of this repo to you? This is your hard work, after all. =)

— Reply to this email directly or view it on GitHub https://github.com/sloria/textblob-aptagger/issues/1#issuecomment-55693543 .