simjanos-dev / LinguaCafe

LinguaCafe is a self-hosted software that helps language learners read foreign languages.
https://simjanos-dev.github.io/LinguaCafeHome/
GNU General Public License v3.0
888 stars 32 forks source link

fix: bind punctuation to following word when needed #361

Open cblanken opened 2 months ago

cblanken commented 2 months ago

This PR fixes issue #328.

The existing punctuation spacing implementation uses a spaceAfter attribute to force a space (right padding) after punctuation from the tokens_with_no_space_before config array. However, for many punctuation marks this reduces readability since punctuation (like quotation marks) may appear to be bound to the previous word when it should be bound to the next word.

For example:

Tommy said, "Hello there". would be incorrectly rendered like this Tommy said," Hello there".

This fix propagates the is_punct and whitespace_ attributes from the spacy tokenization and a boolean (space_before) for tokens with preceding whitespace. They're all used to more accurately arrange the space surrounding punctuation.

@simjanos-dev I made some progress on this, but I've only tested it with German so far. It seems to work well, but I was wondering if you had a particular procedure or some sample texts you used to test? I want to make sure my changes didn't break rendering for the other languages.

New rendering

image

Previous rendering

image

simjanos-dev commented 1 month ago

Hi!

Sorry for not responding for this long, did not have energy to check this out yet, then kind of forgot.

I made some progress on this, but I've only tested it with German so far. It seems to work well, but I was wondering if you had a particular procedure or some sample texts you used to test? I want to make sure my changes didn't break rendering for the other languages.

I ususally using lingua dot com website for languages that it has. I use nhk news easy for Japanese (the button next to the sound button turns off hiragana above the words, then you can copy articles). For other languages I just google language + simple story.

If you want to test it, the current unique languages are Japanese, Chinese and Thai. The rest work mostly the same, although tokenization part is always dependent on the spacy model.

Are " characters in your improved rendering separate words?

cblanken commented 1 month ago

Thanks, I'll do some more testing on the other languages when I get a chance.

Are " characters in your improved rendering separate words?

Yeah, they're still separate words. I didn't change the tokens just some of the metadata for rendering.

image

Here's an example with them not marked as known.

cblanken commented 1 month ago

Okay, I tested a few other languages (Chinese, Japanese, and Russian) and they seem to render fine, but Thai texts are failing to import. Will have to look into it some more.

simjanos-dev commented 1 month ago

Thank you, It looks great! I'll test and merge it when I'll have time.