Addind ability to pretrain embeddings on different corpus

Jean-Baptiste-Camps commented 6 years ago

Attempt to implement an option to pretrain embeddings from a different corpus, as in #80 , with an option in CLI. (not done: documenting it in Readme).

Jean-Baptiste-Camps commented 6 years ago

I'm not sure Thibault will be able to review anytime soon, so I'm adding the two of you, if you can have a look.

emanjavacas commented 6 years ago

I am not sure about the use of the pretrainer there. I think the idea is to have embeddings already pretrained in some format, and load them directly into the model.

Jean-Baptiste-Camps commented 6 years ago

We could do that this way also, in which case we would need a separate function and CLI command, I guess. I coded it this way to make a minimal modification, by just adding an argument, and avoiding the need to do two things instead of one, but the other approach might be more efficient in the long run.

emanjavacas commented 6 years ago

I haven't looked too much into the entire loading logic, to be honest. But if you ask me, I'd rewrite multiple parts to make it more efficient and concise. There seems to be quite a lot of code reduplication because of the way we are dealing with multiple scenarios (do you do dev test, do you include pos, lemma, etc...). Also it's storing all data in memory although we could use generators to avoid that.

2017-10-29 13:11 GMT+01:00 Jean-Baptiste-Camps notifications@github.com:

We could do that this way also, in which case we would need a separate function and CLI command, I guess. I coded it this way to make a minimal modification, by just adding an argument, and avoiding the need to do two things instead of one, but the other approach might be more efficient in the long run.

— You are receiving this because your review was requested. Reply to this email directly, view it on GitHub https://github.com/hipster-philology/pandora/pull/83#issuecomment-340257802, or mute the thread https://github.com/notifications/unsubscribe-auth/AF6Ho2P6jRJxxGwdFiIqliyQaGpEzocgks5sxGt-gaJpZM4QJt42 .

-- Enrique Manjavacas.

mikekestemont commented 6 years ago

I agree that this is a huge issue, especially for larger corpora, resulting from the fact that I originally developed this with a much less generic goal in mind. I am willing to put some work in this, but then we should immediately discuss the whole architecture too. Most importantly, we were considering to apply a bilstm at the verse or sentence level for pos-tagging (which would be a huge change, but managable if we rewrite the entire core anyway).

emanjavacas commented 6 years ago

yes, instead of only using a context window you should be able to use the entire sequence. The only issue to have in mind is that you don't recompute the RNN sequence for each of the predictions in the sequence. I guess it would be hard to implement that setup with keras.

PonteIneptique commented 6 years ago

I am reacting on one point from @mikekestemont : I would absolutely try to avoid inserting expectations such as sentence (if this relies on punctuation) or verse (which relies on text's topography). This would narrow down the ability of the tagger to be as language independant as it can be right now, and would mean even more overfitting with data : sentence's punctuation is not an universal shared method across editions and editors, and it would most probably lead to dataset being incompatible to some degree. On my end, when I have to deal with classical / medieval text, I remove punctuation. I believe they are way too much specific to editors...

Jean-Baptiste-Camps commented 6 years ago

Not sure we are on the good thread for this conversation, but I agree on the aspect of punctuation. It is a bit different for verses, though, that are consubstantial with the text itself. But you can have tagged sentences independent from punctuation (<s> tags and the like), but still, it is interpretative and editor specific.

mikekestemont commented 6 years ago

It's true that punctuation is often lacking and, if it is there, it is often editorial. On the other hand, if you have it, it often helps the tagger tremendously because it gives insight into the syntactic structure of a text.
As to these units: historical texts will often have verses, which also helps the tagging process (e.g. knowing which words are in rhyme position). If we want biLSTMs for the PoS tagging (which would almost certainly improve on what we now) you can also work with random slices of the text (e.g. 10 words each time).

mikekestemont commented 6 years ago

This new functionality gives an error when you don't want to load embeddings now (which should be the default). It also seems that the code which was pushed is not complete, because load_annotated_dir isn't able to load embeddings?

Jean-Baptiste-Camps commented 6 years ago

Ok, apparently the buggued appeared after. Actually, load annotated dir is just here to load the data that is then used to pretrain embeddings (if you use another corpus to pretrain embeddings).

hipster-philology / pandora

Addind ability to pretrain embeddings on different corpus #83