Open cblanken opened 2 months ago
Hi!
Sorry for not responding for this long, did not have energy to check this out yet, then kind of forgot.
I made some progress on this, but I've only tested it with German so far. It seems to work well, but I was wondering if you had a particular procedure or some sample texts you used to test? I want to make sure my changes didn't break rendering for the other languages.
I ususally using lingua dot com website for languages that it has. I use nhk news easy for Japanese (the button next to the sound button turns off hiragana above the words, then you can copy articles). For other languages I just google language + simple story.
If you want to test it, the current unique languages are Japanese, Chinese and Thai. The rest work mostly the same, although tokenization part is always dependent on the spacy model.
Are "
characters in your improved rendering separate words?
Thanks, I'll do some more testing on the other languages when I get a chance.
Are " characters in your improved rendering separate words?
Yeah, they're still separate words. I didn't change the tokens just some of the metadata for rendering.
Here's an example with them not marked as known.
Okay, I tested a few other languages (Chinese, Japanese, and Russian) and they seem to render fine, but Thai texts are failing to import. Will have to look into it some more.
Thank you, It looks great! I'll test and merge it when I'll have time.
This PR fixes issue #328.
The existing punctuation spacing implementation uses a
spaceAfter
attribute to force a space (right padding) after punctuation from thetokens_with_no_space_before
config array. However, for many punctuation marks this reduces readability since punctuation (like quotation marks) may appear to be bound to the previous word when it should be bound to the next word.For example:
This fix propagates the
is_punct
andwhitespace_
attributes from the spacy tokenization and a boolean (space_before
) for tokens with preceding whitespace. They're all used to more accurately arrange the space surrounding punctuation.@simjanos-dev I made some progress on this, but I've only tested it with German so far. It seems to work well, but I was wondering if you had a particular procedure or some sample texts you used to test? I want to make sure my changes didn't break rendering for the other languages.
New rendering
Previous rendering