logstash-plugins / logstash-filter-csv

Apache License 2.0
15 stars 41 forks source link

Logstash CSV Filter - Quote character parse failure #64

Open NerdSec opened 6 years ago

NerdSec commented 6 years ago

Hi I have a CSV file and the format is something like this:

"102","60","Open","I hope this works out for \"random.guy@gmail.com\""

When i parse this using the CSV filter i get the following error:

[2018-01-23T13:11:58,523][WARN ][logstash.filters.csv ] Error parsing csv {:field=>"message", :source=>"\"102\",\"60\",\"Open\",\"I hope this works out for \\\"random.guy@gmail.com\\\"\"", :exception=>#<CSV::MalformedCSVError: Missing or stray quote in line 1>}

The quote characters seem to be malformed in the error message. I have currently worked around this issue by using gsub before passing the data to the csv filter. Is this a know bug with the csv filter?

https://discuss.elastic.co/t/csv-filter-quote-character-parse-failure/116611

chnsh commented 6 years ago

I too have the same issue!

SHSauler commented 6 years ago

Could you please show how you work around this issue with gsub? I tried removing quote characters and whitespace (as per issue #44), but it still leads to _csvparsefailure.

NerdSec commented 6 years ago

Hi,

My issue was that I had extra double quotes inside the field. The issue you referenced has spaces between two fields. I had similar issue once with a csv. I ended up using the pandas python library to parse and index that data.

You could do the following and check if it works:

mutate {
    gsub => [
      'fieldname', '\"', ''
      'fieldname', ',\s+', ','
    ]
  }

Open a post on the forum (discuss.elastic.co) and ping me. We can continue over there if needed.

jsvd commented 6 years ago

WRT the initial issue, the example doesn't seem to be wellformed csv: https://csvlint.io/validation/5ae2c74704a9ea0004000048

Also, a csv linter in go only accepts the file with a flag to "try to parse improperly escaped quotes":

% cat txt.csv
"102","60","Open","I hope this works out for \"random.guy@gmail.com\""
% ./go/bin/csvlint txt.csv
Record #0 has error: extraneous or missing " in quoted-field

unable to parse any further
% ./go/bin/csvlint -lazyquotes txt.csv
Warning: not using defaults, may not validate CSV to RFC 4180
file is valid
droberts195 commented 5 years ago

The RFC for CSV says "a double-quote appearing inside a field must be escaped by preceding it with another double quote" (it's rule 7 in section 2).

A non-standard alternative is of course to escape quotes using some other character, usually a backslash.

Other CSV parsers have also had the dilemma over whether to support these non-standard CSV formats. For example, SuperCSV agonised over it for a while in super-csv/super-csv#14 before eventually adding an option to support it in super-csv/super-csv#103.

I've added this comment to (a) make sure there's a clear statement of the dilemma and (b) subscribe to the issue because other parts of the Elastic Stack also use CSV now, and it would be nice if there was consistency about which escaping options are supported. Currently the find_file_structure endpoint added in ML 6.5.0 doesn't support non-standard escaping of quotes. But this could be added to find_file_structure in a future release if Logstash and or Filebeat ever support this.