php-edifact / edifact

Tools to process EDI messages in UN/EDIFACT format
GNU Lesser General Public License v3.0
272 stars 87 forks source link

Describe validation regex for standard character sets #25

Closed sabas closed 7 years ago

sabas commented 8 years ago

https://www.stylusstudio.com/edifact/40003/0001.htm

For EDIFACT documents of syntax version UNOA, characters A-Z, 0-9, blank and . , ( ) / - = are allowed. For syntax version UNOB, characters a-z, A-Z, 0-9, blank and . , ( ) / - = : + ` ? are allowed. All other EDIFACT syntaxes are linked to the standard ISO character sets. [https://msdn.microsoft.com/en-us/library/aa559562(v=bts.20).aspx]

UNOB can use these separators: The Information Separator control characters are used as follows. IS 4 hex value '1C' segment terminator IS 3 hex value '1D' data element separator IS 1 hex value '1F' component data element separator

UNOC to UNOK use ISO-8859-*

homer8173 commented 7 years ago

hello, i currently tryed to send an UNOC with accents 'éà' and the Parser delete those characters ? am i doing something wrong or must i improve your Parser ?

sabas commented 7 years ago

You need to override the stripping regex like this.

$p = new Parser();
$p->setStripRegex("/[\x01-\x1F\x80-\xFF]/");
$p->loadString(SOMEEDIFACTSTRING);

(this regex removes all chars between these hex codes, if you find a regex that suits UNOC please share :-) )

homer8173 commented 7 years ago

Thanks for your answer, i discovered that i have to use setStripRegex() i was wondering if it was not a good idea to read the section UNB, get the UNOx information and adapt directly the regexp even with the current default for those we don't know.

do you mind if i modify your Parser on this way ?

sabas commented 7 years ago

Please go ahead! By default it should go with UNOA, if it's different it should override calling setStripRegex.

homer8173 commented 7 years ago

Thanks, i'm just concern that the Parser will have to read the content and we have a Reader made for that. I've seen that your Parser was already adaptive on UNA message. I will use a similar method for UNB.

About this UNA message, you've chosen to delete it from $parsedfile. So if you Parse a file and re-Encode it with the Encoder, the consequence it that you have modified the original EDI file. That was one of my first test on your solution and it fails.

sabas commented 7 years ago

The UNA part was added by @Azzurvif

Perhaps we could save the UNA array (if existing) in a private variable and add a public get method, so when encoding back you can reuse it?

homer8173 commented 7 years ago

For UNA, that could be a solution, on my project Parser and Encoder are far away from each other.

Maybe it's the job of the Encoder to set its default splitting characters and so to add UNA according to the setups. This way, it's a mirror solution with the Parser. For backward compatibility Encoder may not add UNA by default.

homer8173 commented 7 years ago

According to this spec https://sandroaspbiztalkblog.wordpress.com/2009/08/15/edifact-encoding-edi-character-set-support/ UNOA encoding is more restrictive than what you set by default, your regexp \x01-\x1F\x80-\xFF appear to be UNOB nor UNOA

I will set UNOB as default mode for backward compatibility but for those who are receiving an UNOA message within min chars, it would make a deprecation, those min chars are not UNOA complient. What you want to do ? You want to stay on standards or preserve use for users ?

one more info about UNOA : http://myedinotes.blogspot.fr/2012/05/unoa-character-set.html

sabas commented 7 years ago

Currently Encoder simply encodes with standard delimiters (they are hardcoded), It would need a refactoring as the parser, so if one wants to encode with non standard chars the code should be

$c = new Encoder();
$c->setUNA(XXXXX);
$c->encode($array, $wrap);

I agree to standardize to UNOB as default, so to not break compatibility.

homer8173 commented 7 years ago

I let you check, but i think i have implemented everything we talked about. Have a nice week.