Closed pulse-mind closed 1 month ago
Thank you for the time spent on this project. @nerdoc Is that possible to accept it ? If yes when do you plan to do it, merge and release ?
Ok, this is interesting, and I think it would be definitely a possibility to merge such a PR/idea. But I think there has to be done some "thinkwork" before... I can thin of 2 ways of handling that prepending header:
I could think about accepting it because of use case 2, as I myself maybe would have use for that too.
There is one thing I can see which could lead to a problem. If you use message.find("UNA"), this correctly finds 56xyz%<GARBAGE,+Foo'Bla++UNA:+,? 'UNB...
.
But it also would happily find UNA here:
Lorem ipsum dolor sit amet, consetetur sadipscing elitr, UNA sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua.
and use " sed " as control characters, which is not desireable.
ATM pydifact "identifies" a EDIFACT file using the UNA header, if present, and if not, it uses the default control characters and starts parsing. There is no error handling if you throw a "Lorem ipsum" file onto pydifact.
Let me think about it a few days, but I think I can accept that PR. Thanks for adding a test to it - maybe you could shorten the sample file a bit to speed up the testing, if that is ok for you - I think it is not necessary to use the whole file.
If yes when do you plan to do it, merge and release ? Oh, a release could be done shortly after the merge. This shouldn't be the problem.
You're right, UNA can be found in the string where it is not a "UNA". So may be we can look for a regexp that match a normal UNA !? What is the normal format, I can build the regexp... And for the testfile, I can try to reduce it...
It's not easy, and maybe impossible. As it would be perfectly valid EDIFACT to use "ABCDEF" as control characters, a file could start with UNAABCDEF
. "F" would be the segment separator then, and every segment would be terminatef with "F". So no regexp would help much, the only pattern that comes to my mind: the control characters must not repeat themselves. So UNA123456
is valid, UNA122456
is invalid.
This could be done additionally as small check if the UNA Advise Segment is a valid one.
Ok so let's think differently. We want to improve the current behavior.
Before, the code checks if it starts by "UNA". So the goal is to add feature that does not break the actual behavior.
If I look at https://unece.org/fileadmin/DAM/trade/edifact/untdid/d422_s.htm. There is always a ' (segment terminator) at the end. So we can think to two use case :
I can improve the code in this way ?
I made another branch with this suggestion that sounds better, this is the commit : https://github.com/pulse-mind/pydifact/commit/303cfd52a8de2ccb5abc3e993c685f293f56e00d
That looks better. And It is even better than the current solution. I think this is a useful addition and will maybe solve other "non-strictly" compatibility problems - maybe there are other forms of EDIFACT file implementations that start with another header.
And it doesn't hurt current behaviour.
So yes, please implement this change in this PR, and I will merge it!
Just make sure you run "black" at the end, as the current patch is not accepted by the black format CI pre-commit hook! Thanks!
cool @nerdoc . I closed this one and open a new pull request. I run black on the other branch, I hope I did it well. I am not used with this even if I should :D.
(pydifact) ➜ pydifact git:(compatibility_edi_generated_by_sage_coala_v2) python -m black ./
All done! ✨ 🍰 ✨
33 files left unchanged.
fix #75