jf-tech / omniparser

omniparser: a native Golang ETL streaming parser and transform library for CSV, JSON, XML, EDI, text, etc.
MIT License
931 stars 68 forks source link

Introducing `repetition_delimiter` to EDI schema. #215

Closed jf-tech closed 1 year ago

jf-tech commented 1 year ago

Issue: https://github.com/jf-tech/omniparser/issues/212

repetition_delimiter: delimiter to separate multiple data instances for an element. For example, if ^ is the repetition delimiter for a segment DMG*D8*19690815*M**A^B^C^D~, then the last element has 4 pieces of data: A, B, C, and D. Any element without repetition_delimiter present has essentially one piece of data; similarly, if ^ is the repetition delimiter for a segment CLM*A37YH556*500***11:B:1^12:B:2~, the last element has 2 pieces of data: 11:B:1 and 12:B:2, each of which is further delimited by a component_delimiter :. Note, since repetition_delimiter creates multiple pieces of data under the same element name in the schema, in most cases the suitable construct type in transform_declarations is array.

Currently we read in all the elements and their components in serial in NonValidatingReader into a slice: []RawSegElem, each of which contains the element value, the element index, and component index if there are more than 1 component. When repetition_delimiter is added, we continue down the same pattern: NonValidatingReader still reads everything into the slice, except now, there potentially can be multiple RawSegElem share the same ElemIndex and CompIndex.

Using the example above: ^ is the rep delim and seg is CLM*A37YH556*500***11:B:1^12:B:2~. After NonValidatingReader.Read() is done, we'll have the following []RawSegElem (simplified):

{
   {'CLM', ElemIndex: 0, CompIndex: 1},
   {'A37YH556', ElemIndex: 1, CompIndex: 1},
   {'500', ElemIndex: 2, CompIndex: 1},
   {'', ElemIndex: 3, CompIndex: 1},
   {'', ElemIndex: 4, CompIndex: 1},
   {'', ElemIndex: 4, CompIndex: 1},
   {'11', ElemIndex: 5, CompIndex: 1},
   {'B', ElemIndex: 5, CompIndex: 2},
   {'1', ElemIndex: 5, CompIndex: 3},
   {'12', ElemIndex: 5, CompIndex: 1},
   {'B', ElemIndex: 5, CompIndex: 2},
   {'2', ElemIndex: 5, CompIndex: 3},
}

Note the last 3 elements have the same ElemIndex and CompIndex as the previous 3 elements. This behavior is new and introduced in this PR.

Now on the EDI reader side (reader.go), previously when we match element decl against the raw element slice, we only do one way scan, because ElemIndex and CompIndex are always increase, thus we never need to back-scan. With introduction of potentially duplicate ElemIndex and CompIndex, now for each of the element decl, we simply do a full []RawSegElem scan. Yes, it is a bit more expensive but given usually the number of total elements and components in a seg is really really small (around 20), we feel this trade-off is acceptable without making the already-complex code even more so.

With this reader change, the IDR produced will potentially contain child element nodes with the same element name. Thus in schema writing, it's practically required that the user of the repetition_delimiter feature needs to use array type in the transform_declarations.

codecov[bot] commented 1 year ago

Codecov Report

Patch coverage: 100.00% and no project coverage change.

Comparison is base (9e0c8da) 100.00% compared to head (80b2ff2) 100.00%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #215 +/- ## ========================================= Coverage 100.00% 100.00% ========================================= Files 53 53 Lines 3027 3041 +14 ========================================= + Hits 3027 3041 +14 ``` | [Files Changed](https://app.codecov.io/gh/jf-tech/omniparser/pull/215?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=JF+Technology) | Coverage Δ | | |---|---|---| | [idr/marshal2.go](https://app.codecov.io/gh/jf-tech/omniparser/pull/215?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=JF+Technology#diff-aWRyL21hcnNoYWwyLmdv) | `100.00% <ø> (ø)` | | | [extensions/omniv21/fileformat/edi/reader.go](https://app.codecov.io/gh/jf-tech/omniparser/pull/215?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=JF+Technology#diff-ZXh0ZW5zaW9ucy9vbW5pdjIxL2ZpbGVmb3JtYXQvZWRpL3JlYWRlci5nbw==) | `100.00% <100.00%> (ø)` | | | [extensions/omniv21/fileformat/edi/reader2.go](https://app.codecov.io/gh/jf-tech/omniparser/pull/215?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=JF+Technology#diff-ZXh0ZW5zaW9ucy9vbW5pdjIxL2ZpbGVmb3JtYXQvZWRpL3JlYWRlcjIuZ28=) | `100.00% <100.00%> (ø)` | | | [idr/marshal1.go](https://app.codecov.io/gh/jf-tech/omniparser/pull/215?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=JF+Technology#diff-aWRyL21hcnNoYWwxLmdv) | `100.00% <100.00%> (ø)` | |

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.