imazen / folioxml

Folio Flat File to XML/HTML/Lucene conversion framework
Apache License 2.0
12 stars 9 forks source link

Overlapping Fields #10

Open Darryl-CogniLore opened 9 years ago

Darryl-CogniLore commented 9 years ago

Hi, thanks for making this tool available! We have been testing this out and came across the following.

I have overlapping fields in my source flat file; using the folioxml tool, the output is well-formed but the applied field boundaries are not the same as was in the source.

Thanks for your time.

Darryl

example source fff <RD> <FD:"field 1"><FD:"field 2">field 1 and field 2 </FD:"field 1"> field 1</FD:"field 2"> <FD:"field 1"> field 1 <FD:"field 2"> field 1 and field 2</FD:"field 1"> field 2</FD:"field 2">

output (xml) <record> <span class="field1" type="field 1"><span class="field2" type="field 2">field 1 and field 2 field 1</span> <span class="field1" type="field 1"> field 1 <span class="field2" type="field 2">field 1 and field 2 field 2</span>

</span> </span> </record>

lilith commented 9 years ago

Hi Darryl,

This is very interesting - a case our unit tests must have missed. I'd also be interested in the SLX output to see where the error is happening.

Is this a problem your company would needs particularly prompt resolution for? We do offer paid services.

Darryl-CogniLore commented 9 years ago

Nathanael,

Here is what is in the .slx. (note that I have been adding whitespace for easier reading)

<record> <p> <span class="field1" type="field 1"><span class="field2" type="field 2">field 1 and field 2 field 1</span type="field 2"> <span class="field1" type="field 1">field 1 <span class="field2" type="field 2">field 1 and field 2 field 2</span type="field 2"> </p> </span type="field 1"></span type="field 1"> </record>

And these 2 log entries were seen during the conversion process. Dropping orphaned closing ghost tag { </span type="field 1"> : </FD:"field 1"> Line x col y in file } Dropping orphaned closing ghost tag { </span type="field 1"> : </FD:"field 1"> Line x col y in file }

Darryl-CogniLore commented 9 years ago

Nathanael, this isn't an urgent issue at this time however if we are to continue using this tool would need a resolution. When we get to that point, we will consider taking a crack at it ourselves in case you haven't already.

Cheers

lilith commented 9 years ago

Quite strange. This shouldn't happen unless field2 is somehow classified as a 'context' element. This is the full XML, no tags or elements removed? If you can set up a unit test and send a PR that reproduces this, I can try to take a look at it the following week.

lilith commented 9 years ago

One note, for clarity, the last word should be "field 2", not "field 1".

  <FD:"field 1"><FD:"field 2">field 1 and field 2 </FD:"field 1"> field 1</FD:"field 2">
lilith commented 9 years ago

I've tried this exact set of fields and can't reproduce the problem on the develop branch. Which commit did you use to create this result?

https://github.com/imazen/folioxml/commit/e5e33e6a9358f886c13641e156d605e3f5e59266

Darryl-CogniLore commented 9 years ago

Sorry Nathanael,

From your message 3 days ago, yes you are correct the last word should be "field 2". What did the .slx look like from your testing?

We used the latest on the master branch at the time, early May 2015. I apologize, the following information would probably have been helpful earlier: we used IKVM compiler to build a .NET assembly from the Java byte code and now use/run the .net assembly to transform FFF. We did not include the overlapping field scenario in our initial testing with the java source and jars.