Allow EDI file to have something before UNA or UNB. It can happens with files generated by the software SAGE COALA.

pulse-mind commented 1 month ago

fix #75

pulse-mind commented 1 month ago

Thank you for the time spent on this project. @nerdoc Is that possible to accept it ? If yes when do you plan to do it, merge and release ?

nerdoc commented 1 month ago

Ok, this is interesting, and I think it would be definitely a possibility to merge such a PR/idea. But I think there has to be done some "thinkwork" before... I can thin of 2 ways of handling that prepending header:

This header sage coala prepends, what is it? Should the data in it be parsed too in any way? Does it contain relevant data for the message, or is it just a container that "contains" edifact data? To be honest, maybe then it would be out of scope of pydifact to read that data, and a "sage coala" parser must be used, and pydifact could then parse the output of that parser - the edifact payload.
The header can be completely ignored, and sage coala is one of many companies that use a header for their purposes and just transport edifact data as payload. It could be accepted that pydifact loads these formats.

I could think about accepting it because of use case 2, as I myself maybe would have use for that too.

There is one thing I can see which could lead to a problem. If you use message.find("UNA"), this correctly finds 56xyz%<GARBAGE,+Foo'Bla++UNA:+,? 'UNB.... But it also would happily find UNA here: Lorem ipsum dolor sit amet, consetetur sadipscing elitr, UNA sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. and use " sed " as control characters, which is not desireable.

ATM pydifact "identifies" a EDIFACT file using the UNA header, if present, and if not, it uses the default control characters and starts parsing. There is no error handling if you throw a "Lorem ipsum" file onto pydifact.

Let me think about it a few days, but I think I can accept that PR. Thanks for adding a test to it - maybe you could shorten the sample file a bit to speed up the testing, if that is ok for you - I think it is not necessary to use the whole file.

nerdoc commented 1 month ago

If yes when do you plan to do it, merge and release ? Oh, a release could be done shortly after the merge. This shouldn't be the problem.

pulse-mind commented 1 month ago

You're right, UNA can be found in the string where it is not a "UNA". So may be we can look for a regexp that match a normal UNA !? What is the normal format, I can build the regexp... And for the testfile, I can try to reduce it...

nerdoc commented 1 month ago

It's not easy, and maybe impossible. As it would be perfectly valid EDIFACT to use "ABCDEF" as control characters, a file could start with UNAABCDEF. "F" would be the segment separator then, and every segment would be terminatef with "F". So no regexp would help much, the only pattern that comes to my mind: the control characters must not repeat themselves. So UNA123456 is valid, UNA122456 is invalid. This could be done additionally as small check if the UNA Advise Segment is a valid one.

pulse-mind commented 1 month ago

Ok so let's think differently. We want to improve the current behavior.

Before, the code checks if it starts by "UNA". So the goal is to add feature that does not break the actual behavior.

If I look at https://unece.org/fileadmin/DAM/trade/edifact/untdid/d422_s.htm. There is always a ' (segment terminator) at the end. So we can think to two use case :

file is starting by UNA
file is not starting by UNA so we should look for 'UNA And later we can improve that (start by step).

I can improve the code in this way ?

pulse-mind commented 1 month ago

I made another branch with this suggestion that sounds better, this is the commit : https://github.com/pulse-mind/pydifact/commit/303cfd52a8de2ccb5abc3e993c685f293f56e00d

nerdoc commented 1 month ago

That looks better. And It is even better than the current solution. I think this is a useful addition and will maybe solve other "non-strictly" compatibility problems - maybe there are other forms of EDIFACT file implementations that start with another header.

And it doesn't hurt current behaviour.

So yes, please implement this change in this PR, and I will merge it!

nerdoc commented 1 month ago

Just make sure you run "black" at the end, as the current patch is not accepted by the black format CI pre-commit hook! Thanks!

pulse-mind commented 1 month ago

cool @nerdoc . I closed this one and open a new pull request. I run black on the other branch, I hope I did it well. I am not used with this even if I should :D.

(pydifact) ➜  pydifact git:(compatibility_edi_generated_by_sage_coala_v2) python -m black ./
All done! ✨ 🍰 ✨
33 files left unchanged.

nerdocs / pydifact

Allow EDI file to have something before UNA or UNB. It can happens with files generated by the software SAGE COALA. #76