google / budoux

https://google.github.io/budoux/
Apache License 2.0
1.44k stars 32 forks source link

JA parser returns not valid json #739

Closed Yuri3K closed 1 month ago

Yuri3K commented 1 month ago

Screenshot_131 Screenshot_132

If I process the json file with a parser (shown in the screenshot), I will get back an invalid json. The error occurs in cases where the string begins with the character "始". An error appears in the console (shown in the screenshot). At the moment, to avoid this error, I add a zero-width space before the character "始" directly to the json file. After such actions, the parser returns a valid json file.

PS. Thank you very much for your great library!

tushuhei commented 1 month ago

Hi, thanks for reporting! What character are you using to join the returned value of this.parserJA.parse? What happens if you use an empty character instead, i.e. ''? If the issue still persists, could you share the minimum data example that reproduces the error? The screenshot below is my attempt to reproduce the error, but it worked fine. image

Yuri3K commented 1 month ago

@tushuhei, hello

What character are you using to join the returned value of this.parserJA.parse? Answer: the character is "zero-width-space"

What happens if you use an empty character instead, i.e. '' Answer: If I use an empty character, app starts to work, but the library does not insert the zero-width-space character between words

I attached a zip file with a small Angular app, where you can explore the issue. budoux-ja-app-master.zip You can also find it at the link below https://github.com/Yuri3K/budoux-ja-app

Attached app is working, because in ja.json file I added a space before "始" character Screenshot_144

If you remove space before "始" character , you will get an arror in console. Screenshot_145

Parser is located in app.component.ts file. In this file you can get the "zero-width-space" character, that I used Screenshot_146

Thank you for your feedback!

tushuhei commented 1 month ago

Thanks for sharing the context. The root causation is that BudouX may insert ZWSP in a position that breaks the JSON syntax.

image

I recommend applying BudouX only to object values, not the entire JSON serialized object. The code will look like:

const result = Object.fromEntries(Object.entries(data).map(([key, value]) => [key, parser.parse(value).join(String.fromCharCode(0x200b))]));
Yuri3K commented 1 month ago

Thank you very much for the solution!