Closed dstoekl closed 5 months ago
on training from scratch it advances by one codepoint
On 24/06/17 01:43PM, Daniel Stoekl wrote:
on training from scratch it advances by one codepoint
How are they encoded in the XML ? As normal XML entities?
I can't reproduce it. When I train on other data with > it works as expected and there's nothing that treats those entities differently, in particular not the codec. If you give me access to your data I can take a look.
Thinking about it is probably caused by Unicode mirroring. > is a mirroring character. The codec is constructed while the text is still in logical order but the actual encoding happens on lines already transformed into display order which maps 0x003E to 0x003C which isn't in the codec as you don't have any 0x003C (<) in there.
As a hotfix you can add a codec that includes < manually when training from scratch. I need to think a bit if just creating the codec from display order lines is going to break anything else.
You can also disable the mirroring by encapsulating the > in Left-to-Right markers which does make it compatible with fine-tuning as well.
It is unclear to users and will crash in eScr
Finetuning on altos with > or < fails with the following error: