logstash-plugins / logstash-filter-csv

Apache License 2.0
15 stars 41 forks source link

Introduce option to auto-convert unquoted field values to float or integer #54

Open GrahamHannington opened 7 years ago

GrahamHannington commented 7 years ago

Background to this suggestion

I help to develop a product that forwards logs to Logstash over TCP as either CSV or JSON Lines.

The product can forward hundreds of types of logs. The original logs are in proprietary binary formats. The product parses the original logs and converts them before forwarding.

With JSON Lines, a single Logstash config works for all of these log types.

However, with CSV, the Logstash configs are not only specific to each log type, but are also specific to the particular set of fields (columns) being forwarded from that log type.

With CSV, each Logstash config contains:

Without that convert option, Logstash forwards all CSV field values as strings.

For my personal use cases, I'm reasonably happy forwarding JSON Lines, so, for me, using CSV is more-or-less unnecessary. But when I think about users who—for whatever reason—want to use CSV, I look at those columns and convert options in the CSV flavor of the configs, and I wonder: can we avoid those?

I understand that I can avoid columns by using autodetect_column_names, but what about convert?

Note:

My suggestion

Introduce a new option to the CSV filter that enables it to determine data types from the first row of input CSV data:

I understand that this introduces specific expectations about the formatting of the CSV input data. I'm comfortable with that because I have the luxury of being able to control the format of that data. I understand that this suggestion won't apply to all CSV-format data.

A possible variation, instead of using field values to distinguish string vs numeric: use column names. That is, if a column name (in the header row) is enclosed in quotes, then it's a string field. Otherwise, it's numeric; and float vs integer can be determined by the presence or absence of a decimal point in the first data row (after the header). This would mean that only the header would need to involve quotes; I realize that the first option I described explicitly mentions using the first row of input data, so, strictly speaking, subsequent rows would not need such highlighting, but I'm slightly more "squeamish" about applying such formatting to a single data row than I am about applying that formatting to the header row.

(The product already—deliberately, by design—outputs all values of float fields with a decimal point; if the value of a float field happens to be an integer, then that value is output with a trailing decimal point followed by a single zero. I get that JSON—what Logstash forwards to Elasticsearch—distinguishes between strings and numbers, but not between integers and floating-point numbers.)

Thoughts? (Here's one: forget it, CSV was never supposed to specify data types; don't try to make it.)

GrahamHannington commented 7 years ago

Some notes on my suggestion:

GrahamHannington commented 6 years ago

A simple example, with a header followed by two data rows:

time,trancode,userid,resptime,dbcalls
"2017-09-21T11:35:40.123456Z","DPST","usra",1.264,3
"2017-09-21T11:35:41.654321Z","WTHD","usrb",1.003,2

A variation, with quotes in the header row only:

"time","trancode","userid",resptime,dbcalls
2017-09-21T11:35:40.123456Z,DPST,usra,1.264,3
2017-09-21T11:35:41.654321Z,WTHD,usrb,1.003,2
jsvd commented 6 years ago

Should be easy to leverage the converters feature in the csv library. Changing line:

values = CSV.parse_line(source, :col_sep => @separator, :quote_char => @quote_char)

to:

values = CSV.parse_line(source, :converters => :all, :col_sep => @separator, :quote_char => @quote_char)

Makes processing go from:

{
       "column1" => "2017-09-21T11:35:41.654321Z",
    "@timestamp" => 2017-11-10T17:10:48.358Z,
       "column5" => "2",
       "column4" => "1.003",
       "column3" => "usrb",
       "column2" => "WTHD",
      "@version" => "1",
          "host" => "Joaos-MacBook-Pro-5.local",
       "message" => "\"2017-09-21T11:35:41.654321Z\",\"WTHD\",\"usrb\",1.003,2"
}

to:

{
       "column1" => "2017-09-21T11:35:41.654321Z",
    "@timestamp" => 2017-11-10T17:14:27.742Z,
       "column5" => 2,
       "column4" => 1.003,
       "column3" => "usrb",
       "column2" => "WTHD",
      "@version" => "1",
          "host" => "Joaos-MacBook-Pro-5.local",
       "message" => "\"2017-09-21T11:35:41.654321Z\",\"WTHD\",\"usrb\",1.003,2"
}

If there's any takers, it's just a matter of exposing a new setting to auto detect and set this parameter in the parse_line.