Open aleksas opened 5 years ago
So is the accent on particular characters? You could define tags at character levels and basically work with a character level Transformer Encoder.
Yes, the accent is on specific letter. Does Transform need a dictionary for character level taggng? What should my next steps be in order to train Transformer accentation on Lithuanina language. I have a dataset of ~13 K sentences with accentation. I'm suspicious it may not be enough to train Transformer though, but I'm very keen to try...
I think you can map the input directly to the unicode character values. The infrastructure around the Tagger classes currently works at a word (+char) level. We'll have to make it more generic to handle character only input (An incentive for me to work on this!).
But the Transformer module is independent of the input (check this file).
13K sentences should be more than enough if you're working at a character level. Do you have tags for each character (including none)?
I do have tags for each char.
Should it be possible to use only transformers encoder part to train word accentation for Lithuanian language. In Lithuanian language stressing is somewhat tricky as it can vary dependyng on context along with word meaning (e.g. grammar case). You've mentioned in your post using only encoding part for one to one mapping. In case of Lithuanian language accentation, there are three types of accent and the position of the accent within the word (varies alot). And there can also be no accent at all. Any suggestions?