stant / mdcsvimporter2015

GNU Lesser General Public License v3.0
13 stars 5 forks source link

Literal quotes not recognized #11

Open jaraco opened 3 years ago

jaraco commented 3 years ago

Consider a CSV with this line:

000-0000000-0000000,"LIHAO Snaps and Snap Pliers Set, 375 Sets T5 Plastic Buttons for Sewing and Crafting; Bright Creations Wool Filling for Pillows, Toys, Crafts (Natural White, 1 LB); Quality Park #10 Self-Seal Security Envelopes, Security Tint and Pattern, Redi-Strip Closure, 24-lb White Wove, 4-1/8"" x 9-1/2"", 100/Box (QUA69117); ",Recipient,2020-09-19,64.63,0,,3.66,,"Mastercard ending in 0000: September 19, 2020: $64.63; "

This file is generated from an export of Amazon transactions.

I've attempted to parse the file with this Reader:

image

Attempting to parse with that reader: I see the line breaks at each "":

image

These double quotes ("") are interpreted as escaping to create a single quote " in a field. This CSV opens fine in Excel/Numbers, but when processed by the Importer, it detects the "" after 4-1/8 and 9-1/2 as field separators.

I've attempted to create regex-based parser to capture this second field, but I've been unsuccessful in capturing the field without including the quotes in the value named group. For example, I tried this expression: (?<value>((?:").*?(?:"))|(.*?))(?:[,]|\Z)(?<rest>(.*|\Z)), and while it still has the issue, it also includes the quote characters at the start and end of value. When I tried to specify (?<value>) twice, the parsing failed to produce any output at all, so I'm guessing that's invalid syntax.

This regex effectively captures the double-quotes ((?<value>((?:")([^"]|"")*?(?:"))|(.*?))(?:[,]|\Z)(?<rest>(.*|\Z))), but it has two problems: the quotes are still included at the start and end and the double quotes appear as double-quotes.

Is there any way to configure the importer to read a CSV with a line like above similar to how Numbers or Excel do?

image