kolloldas / torchnlp

Easy to use NLP library built on PyTorch and TorchText
Apache License 2.0
254 stars 44 forks source link

Using only encoder part for word accentation #7

Open aleksas opened 5 years ago

aleksas commented 5 years ago

Should it be possible to use only transformers encoder part to train word accentation for Lithuanian language. In Lithuanian language stressing is somewhat tricky as it can vary dependyng on context along with word meaning (e.g. grammar case). You've mentioned in your post using only encoding part for one to one mapping. In case of Lithuanian language accentation, there are three types of accent and the position of the accent within the word (varies alot). And there can also be no accent at all. Any suggestions?

kolloldas commented 5 years ago

So is the accent on particular characters? You could define tags at character levels and basically work with a character level Transformer Encoder.

aleksas commented 5 years ago

Yes, the accent is on specific letter. Does Transform need a dictionary for character level taggng? What should my next steps be in order to train Transformer accentation on Lithuanina language. I have a dataset of ~13 K sentences with accentation. I'm suspicious it may not be enough to train Transformer though, but I'm very keen to try...

kolloldas commented 5 years ago

I think you can map the input directly to the unicode character values. The infrastructure around the Tagger classes currently works at a word (+char) level. We'll have to make it more generic to handle character only input (An incentive for me to work on this!).

But the Transformer module is independent of the input (check this file).

13K sentences should be more than enough if you're working at a character level. Do you have tags for each character (including none)?

aleksas commented 5 years ago

I do have tags for each char.