"Basic sanitation" removes all sorts of valid UTF8 characters

php-edifact / edifact

Tools to process EDI messages in UN/EDIFACT format

GNU Lesser General Public License v3.0

272 stars 87 forks source link

"Basic sanitation" removes all sorts of valid UTF8 characters #20

Closed notchriss closed 8 years ago

notchriss commented 8 years ago

Letters like Š are removed by the line $line = preg_replace('/[\x00-\x1F\x80-\xFF]/', '', $line); //basic sanitization, remove non printable chars

sabas commented 8 years ago

Hi @notchriss , these letters are not valid per the official syntax rules (see http://goo.gl/cSFrnv section 4 and 5), although you could use it when you specify it with whom you send/receive the message. Sometimes I saw messages with letters with diacritics (in italian we use àòùìèé and so on) get rejected by my carrier..

We could implement an exception somehow? How? I wouldn't remove the line, at the very least instead of replacing we could transliterate with iconv... (edit: actually iconv requires to know the charset of origin...)

notchriss commented 8 years ago

Hi @sabas, I do not think exception is a good idea, if it's a deviation from the standard.

Although it's hard to believe EDIFACT has no way to send such symbols, seeing how it's a relatively popular standard in the aviation industry, and many people's (i.e. passengers) names from different European countries do contain them. Maybe it's possible to use a different encoding agreed upon by the sender and the receiver? In this particular case I happened to be both, so I can just convert the string before sending and then convert it back to UTF8 after receiving.

sabas commented 8 years ago

If you act as both you can use the standards you want :-) Simply remove the line from the library (or you're using it with composer?)

notchriss commented 8 years ago

I did remove it for now to get it working, but it's not a good long term practice. Nobody knows how long I'll maintain both or either of the servers, or where this code is going to be reused later. Standards are created for a reason, everyone and their aunties having their personal standards is not OK :) Anyway this issue can probably be closed, since apparently it's not an issue with your implementation but with EDIFACT standard itself

yoselibero commented 8 years ago

Hi, i have the same problem. I understand the standard, but in reality, not a single EDI supplier in my project (dozen of them) is encoding the data in accordance with it.. so i get all the non ASCII characters (ěščřžýáí, german umlauts etc.).

What i would like to suggest is to have the possibility to choose if this validation and character removing gets triggered.. (some boolean, setter, default can be as it is).. Because when your parser is deleting the characters, i can't use it without overriding it.. (running from Composer of course).

Again, i understand the background of this, but in my reality, there is not a single one company respecting it. So i would need to override the parse method; and check your updates for the rest of my life :/

Thanks for thinking about it.. If you decide to implement some option, please, make that warning triggering based on that option as well.

Be well. Josef

sabas commented 8 years ago

Hi @jiff88 could you try with the latest version (0.0.5), I added a feature to relax this requirement in Parser.

$p = new Parser();
$p->setStripRegex("/[\x01-\x1F\x80-\xFF]/");
$p->loadString(SOMEEDIFACTSTRING);

where "/[\x01-\x1F\x80-\xFF]/" is the default regex. You can change as you wish. (I moved the NULL char stripping to the first regex)

Let me know if it works! :-)

/cc @notchriss if he's still interested

yoselibero commented 8 years ago

Oh my.. my bad! I saw that thing on 0.0.5., but i thought i should specify the VALID ranges, not the ranges that will be cut away.. and it's actually inverse, so great! I couldn't imagine i write regexp that will allow all the ranges i want, that would be huge nonsense. Thank you.