Open cblanken opened 2 months ago
I think "
should be a separate word so it does not mess up actual words. I'll check this out. Can you please copy-paste a sentence that I can test this with?
This is the source paragraph from the example above.
Es ist auch möglich, eine beim Händler gekaufte Disk-Version von Civilization VI mit einem Steam-Konto zu verknüpfen. Dazu klickt man zunächst in der Bibliothek auf die Schaltfläche "Ein Produkt auf Steam aktivieren ..." und stimmt den Nutzungsbedingungen zu. Dann gibt man den Produktschlüssel ein und klickt auf "weiter". Danach lässt sich Civilization VI herunterladen und starten, als hätte man es direkt bei Steam gekauft.
The page rollover in my book happens right on the ellipsis.
I looked at the code a bit, and I think this is probably caused by the tokenization. After the text is tokenized you basically lose the information of the spaces between words. I'm assuming that's why a hardcoded "space" is added after all punctuation marks (it's actually some right padding) even when it doesn't make sense for grouped punctuation like quotes, parentheses, etc.
Yes, that is correct.
I'd like to try making a PR for this. Am I right in thinking it will need to be handled at the tokenization level? I'm thinking an extra property will need to be added to all the punctuation tokens to indicate whether they bind to the left or the right.
There is a file named textblock I think that handles everything related to tokenization on the php side. It has a post processing function.
I think we should not add symbols to words, they should be their own separate word.
Probably out of scope for this issue, but I think it would make a lot of sense for all the conditional, per-language processing to be lifted out of the processTokenizedWords
method. It's likely to get even more unwieldy as more language post-processing is added for other languages.
When a quotation rolls over to a new page, then the quote marks may be bound to the wrong words. I assume this is because the rendering assumes the "first" quote (
"
) found is the beginning of a new quotation when, in fact, it is marking the end of a quotation from the previous page.Example
You can see in the above snippet, that one of the quote marks is bound to
auf
when it should be bound toweiter
instead.