hipster-philology / pandora

A Tagger-Lemmatizer for Natural Languages
MIT License
9 stars 4 forks source link

Different embedding matrices for token and lemma characters #42

Open emanjavacas opened 7 years ago

emanjavacas commented 7 years ago

Currently, there are different embedding matrices for token and lemma characters. This is not a huge increase in model params, since the char vocab is typically very small, but it is probably wasteful in terms of updates. I can imagine that using the same embedding space for both could help...

mikekestemont commented 7 years ago

I see why this could speed things; however, I also believe that this should be something that we can switch off, because we cannot exclude the situation that token and lemma use different alphabets, right?

On Sat, Sep 30, 2017 at 7:21 PM, Enrique Manjavacas < notifications@github.com> wrote:

Currently, there are different embedding matrices for token and lemma characters. This is not a huge increase in model params, since the char vocab is typically very small, but it is probably wasteful in terms of updates. I can imagine that using the same embedding space for both could help...

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/hipster-philology/pandora/issues/42, or mute the thread https://github.com/notifications/unsubscribe-auth/AELJL-e_7R-A8TDbfYwmd8WF6OLMomuEks5snnh8gaJpZM4PpoFI .

emanjavacas commented 7 years ago

Let's make this low priority, then.

On 1 Oct 2017 1:42 p.m., "Mike Kestemont" notifications@github.com wrote:

I see why this could speed things; however, I also believe that this should be something that we can switch off, because we cannot exclude the situation that token and lemma use different alphabets, right?

On Sat, Sep 30, 2017 at 7:21 PM, Enrique Manjavacas < notifications@github.com> wrote:

Currently, there are different embedding matrices for token and lemma characters. This is not a huge increase in model params, since the char vocab is typically very small, but it is probably wasteful in terms of updates. I can imagine that using the same embedding space for both could help...

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/hipster-philology/pandora/issues/42, or mute the thread https://github.com/notifications/unsubscribe-auth/AELJL-e_7R- A8TDbfYwmd8WF6OLMomuEks5snnh8gaJpZM4PpoFI .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/hipster-philology/pandora/issues/42#issuecomment-333370909, or mute the thread https://github.com/notifications/unsubscribe-auth/AF6Ho7oN8LnVuYVC4i1axQkbvAHO_9DGks5sn3qUgaJpZM4PpoFI .