simjanos-dev / LinguaCafe

LinguaCafe is a self-hosted software that helps language learners read foreign languages.
https://simjanos-dev.github.io/LinguaCafeHome/
GNU General Public License v3.0
870 stars 30 forks source link

Quotation marks bound to wrong word #328

Open cblanken opened 2 months ago

cblanken commented 2 months ago

When a quotation rolls over to a new page, then the quote marks may be bound to the wrong words. I assume this is because the rendering assumes the "first" quote (") found is the beginning of a new quotation when, in fact, it is marking the end of a quotation from the previous page.

Example

image

You can see in the above snippet, that one of the quote marks is bound to auf when it should be bound to weiter instead.

simjanos-dev commented 2 months ago

I think " should be a separate word so it does not mess up actual words. I'll check this out. Can you please copy-paste a sentence that I can test this with?

cblanken commented 2 months ago

This is the source paragraph from the example above.

Es ist auch möglich, eine beim Händler gekaufte Disk-Version von Civilization VI mit einem Steam-Konto zu verknüpfen. Dazu klickt man zunächst in der Bibliothek auf die Schaltfläche "Ein Produkt auf Steam aktivieren ..." und stimmt den Nutzungsbedingungen zu. Dann gibt man den Produktschlüssel ein und klickt auf "weiter". Danach lässt sich Civilization VI herunterladen und starten, als hätte man es direkt bei Steam gekauft.

The page rollover in my book happens right on the ellipsis.

cblanken commented 1 month ago

I looked at the code a bit, and I think this is probably caused by the tokenization. After the text is tokenized you basically lose the information of the spaces between words. I'm assuming that's why a hardcoded "space" is added after all punctuation marks (it's actually some right padding) even when it doesn't make sense for grouped punctuation like quotes, parentheses, etc.

simjanos-dev commented 1 month ago

Yes, that is correct.

cblanken commented 1 month ago

I'd like to try making a PR for this. Am I right in thinking it will need to be handled at the tokenization level? I'm thinking an extra property will need to be added to all the punctuation tokens to indicate whether they bind to the left or the right.

simjanos-dev commented 1 month ago

There is a file named textblock I think that handles everything related to tokenization on the php side. It has a post processing function.

I think we should not add symbols to words, they should be their own separate word.

cblanken commented 1 month ago

Probably out of scope for this issue, but I think it would make a lot of sense for all the conditional, per-language processing to be lifted out of the processTokenizedWords method. It's likely to get even more unwieldy as more language post-processing is added for other languages.