miqwit / dedex

A generic efficient DDEX parser. Parse seemlessly the complex DDEX format and transform it into classes easily useable in your PHP project. Supports several versions (3.8.2, 4.1, 4.1.1) and is listed in the official DDEX page: https://kb.ddex.net/display/HBK/Open+Source+Software
MIT License
23 stars 11 forks source link

xml parse multibyte UTF8 string with invalid result #30

Closed ignacioalles closed 5 months ago

ignacioalles commented 6 months ago

An invalid value is produced when multibyte char is present in XML file. Altough it seams to be handled in: https://github.com/miqwit/dedex/blob/76e150b35adb7652eddcb95c261b9b920866c095/src/Controller/ErnParserController.php#L450-L454

there is still an issue because the preceding value to which the new value is concatenated, was trimmed here: https://github.com/miqwit/dedex/blob/76e150b35adb7652eddcb95c261b9b920866c095/src/Controller/ErnParserController.php#L606

thus removing the whitespaces that might be between them.

Current code works fine if special char is not preceded by whitespace (eg: Juan García) but produced wrong value if it is (eg: Juan Ávila results in JuanÁvila)

miqwit commented 6 months ago

Thank you for raising this issue. It's quite specific. I don't really understand your example with Juan García and Juan Ávila, can you be more explicit? I understand the code is removing important spaces, but I don't get the cases where it works fine, then (I suppose it works fine most of the time for artist names).

Thanks for your additional help.

ignacioalles commented 6 months ago

I've made a branch with a sample (and updated test case) at: https://github.com/miqwit/dedex/compare/master...ignacioalles:dedex:utf8

I can make a pull request if you want but I didn't have a fix for it yet.

ignacioalles commented 6 months ago

The bug arises if a character that triggers xml_parser to split the callbacks is preceded by a whitespace or not. In the examples I provided in previous message I tried to illustrate the case, where the letter A with accent (Á) is the first letter of the second word while the letter I with accent (í) is in the middle of García.

ignacioalles commented 5 months ago

I'm closing the issue and I can confirm that is released version 2.0.7 solves my case.