org-tigris-jsapar / jsapar

JSaPar is a Java library providing a schema based parser and composer of almost all sorts of delimited (CSV) and fixed width files.
Apache License 2.0
16 stars 5 forks source link

Parse result is wrong where have comma or quote inside double quotation #8

Closed trietbui85 closed 5 years ago

trietbui85 commented 5 years ago

I notice when data have the comma (,) inside field with quotation (" ") then data of that field will be parse wrong.

Here is the schema:

<?xml version="1.0" encoding="UTF-8"?>
<schema xmlns="http://jsapar.tigris.org/JSaParSchema/2.0">
  <csvschema lineseparator="\n">
    <line occurs="*" linetype="Person" cellseparator="," firstlineasschema="true">
      <cell name="Manufacturer" quotebehavior="ALWAYS" />
      <cell name="Model Name" />
      <cell name="Model Code" />
      <cell name="RAM (TotalMem)" />
      <cell name="Form Factor" />
      <cell name="System on Chip" />
      <cell name="Screen Sizes" />
      <cell name="Screen Densities" />
      <cell name="ABIs" />
      <cell name="Android SDK Versions" />
      <cell name="OpenGL ES Versions" />
    </line>
  </csvschema>
</schema>

And CSV input data:

Manufacturer,Model Name,Model Code,RAM (TotalMem),Form Factor,System on Chip,Screen Sizes,Screen Densities,ABIs,Android SDK Versions,OpenGL ES Versions
"Hon Hai Precision Industry Co., Ltd.",Germany,S9714,973MB,Tablet,NVidia Tegra 3 T30,1280x752,160,armeabi-v7a;armeabi,15;16,2.0
10.or,10or_G2,G2,3749MB,Phone,Qualcomm SDM636,1080x2246,480,arm64-v8a;armeabi-v7a;armeabi,27,3.2
zyrex,ZT 216,ZT_216, ,Phone,Spreadtrum SC9832A,600x1024,213,armeabi-v7a;armeabi,24,2.0
Acer,A3-A40,acer_jetfirefhd,1900-1941MB,Tablet,Mediatek MT8163,1920x1200,240,arm64-v8a;armeabi-v7a;armeabi,23,3.0

In above sample data, "Hon Hai Precision Industry Co., Ltd." is parsed into 2 fields: Hon Hai Precision Industry Co and Ltd.". Do you know any wrong with my schema? And how to fix it?

stenix71 commented 5 years ago

You need to define the quote character to use for the line type, like this: quotechar="&quot;" So the whole line declaration should be like this: <line occurs="*" linetype="Person" cellseparator="," firstlineasschema="true" quotechar="&quot;"> Please let me know if this solved your issue.

The quote behavior that you specified on cell level is only used to specify the behavior while composing, it does not define which quote character to use while parsing.

Another thing. When you use firstlineasschema, you don't necessarily need to specify each cell that may occur if they only should be parsed as strings. It does not harm doing so but the cell names and order will be fetched from the first line anyway. Only cells that has specific characteristics (such as different data type or quote behavior) needs to be specified for the line type.

trietbui85 commented 5 years ago

Thanks @stenix71 it really work. If you don't mind, I have another issue with double quotation that need your help. In above case, I try to parse such data:

Manufacturer,Model Name,Model Code,RAM (TotalMem),Form Factor,System on Chip,Screen Sizes,Screen Densities,ABIs,Android SDK Versions,OpenGL ES Versions
"Hon Hai Precision Industry Co., Ltd.",Germany,S9714,973MB,Tablet,NVidia Tegra 3 T30,1280x752,160,armeabi-v7a;armeabi,15;16,2.0
10.or,10or_G2,G2,3749MB,Phone,Qualcomm SDM636,1080x2246,480,arm64-v8a;armeabi-v7a;armeabi,27,3.2
zyrex,ZT 216,ZT_216, ,Phone,Spreadtrum SC9832A,600x1024,213,armeabi-v7a;armeabi,24,2.0
Acer,A3-A40,acer_jetfirefhd,1900-1941MB,Tablet,Mediatek MT8163,1920x1200,240,arm64-v8a;armeabi-v7a;armeabi,23,3.0
Asus,"Commercial tablet 8"" (M800M)",P00A_M,1954MB,Tablet,Mediatek MT8163,800x1280,213,arm64-v8a;armeabi-v7a;armeabi,23,3.0
Insignia,"Flex 8""",ns_15at08,1024MB,Tablet,Rockchip RK3188T,1024x768,160,armeabi-v7a;armeabi,19,2.0
Samsung,"Galaxy Tab A (8.0"", 2019)",gto,1846MB,Tablet,Qualcomm SDM429,800x1280,213,arm64-v8a;armeabi-v7a;armeabi,28,3.2

According to Guide > Basic Schema > Parsing quoted values (link)

As long as you have activated quoting as described above, the parser will automatically detect if a cell is quoted or not. Not all cells needs quoting. A cell is considered to be quoted if and only if the first and the last character of the cell is the quote character. The quote characters will always be removed from the parsed value ... "aaa","b""bb","ccc"... will be parsed to b""bb

But as my result, the second field is Galaxy Tab A (8.0", the the third field is 2019, while expected is Galaxy Tab A (8.0", 2019). Same issues happen with Commercial tablet 8" (M800M), Flex 8".

Do I mis-configure somewhere?

stenix71 commented 5 years ago

This one is trickier though. Since there is a comma directly after the second quote, the parser will consider this as the end of the cell, even though it is the second quote. I can look into if this can be solved for future release but the current version does not allow this. There are so many variants of CSV files and how to parse quotes.

stenix71 commented 5 years ago

I have implemented parsing that can handle RFC4180 style quoting, like in your example above. If you are eager to try it out you can use the version published as 2.1.0-SNAPSHOT or you can clone the 2.1-develop branch and build it yourself. If you use the maven snapshot version you need to have snapshot versions enabled while searching the maven repository.

In order to change the quote syntax you will need to set the quotesyntax attribute on csvschema element like this: <csvschema lineseparator="\n" quotesyntax="RFC4180">

I also changed the default quote character to be the double quote character as you expected so quoting is now enabled by default. The reason for not having so in the first place was that in an older version of the library, parsing quoted sources was quite inefficient. This has been solved now though. Parsing quoted and non-quoted sources takes roughly the same time now.

I will need to test the 2.1.0 version a bit more until I release it officially.

trietbui85 commented 5 years ago

I try to search for enable Snapshot in Intellij but it doesn't found org.tigris.jsapar:jsapar:2.1.0-SNAPSHOT, thus I build it myself. Well, after upgrade to 2.1.0-SNAPSHOT, I use following schema (as you said):

<schema xmlns="http://jsapar.tigris.org/JSaParSchema/2.0">
    <csvschema lineseparator="\n" quotesyntax="RFC4180">
        <line occurs="*" linetype="Person" cellseparator="," firstlineasschema="true" quotechar="&quot;">
        </line>
    </csvschema>
</schema>

And the result looks so good: text like "Galaxy Tab A (8.0"", 2019)" is parsed correctly to "Galaxy Tab A (8.0\", 2019)". That's what I expected. Thank you for your hard work.