Introduce option to auto-convert unquoted field values to float or integer

GrahamHannington commented 7 years ago

Background to this suggestion

I help to develop a product that forwards logs to Logstash over TCP as either CSV or JSON Lines.

The product can forward hundreds of types of logs. The original logs are in proprietary binary formats. The product parses the original logs and converts them before forwarding.

With JSON Lines, a single Logstash config works for all of these log types.

However, with CSV, the Logstash configs are not only specific to each log type, but are also specific to the particular set of fields (columns) being forwarded from that log type.

With CSV, each Logstash config contains:

A columns option that identifies the specific set of columns that the config is intended for
A convert option that converts numeric fields to either integer or float

Without that convert option, Logstash forwards all CSV field values as strings.

For my personal use cases, I'm reasonably happy forwarding JSON Lines, so, for me, using CSV is more-or-less unnecessary. But when I think about users who—for whatever reason—want to use CSV, I look at those columns and convert options in the CSV flavor of the configs, and I wonder: can we avoid those?

I understand that I can avoid columns by using autodetect_column_names, but what about convert?

Note:

I don't want to predefine mappings for all of these log types.
Currently, I deliberately forward the CSV data without a header row. I can explain why, if asked. But I can also choose to include a header row.

My suggestion

Introduce a new option to the CSV filter that enables it to determine data types from the first row of input CSV data:

String vs numeric: the presence or absence of quotes around the field value
If numeric, float vs integer: the presence or absence of a decimal point in the field value

I understand that this introduces specific expectations about the formatting of the CSV input data. I'm comfortable with that because I have the luxury of being able to control the format of that data. I understand that this suggestion won't apply to all CSV-format data.

A possible variation, instead of using field values to distinguish string vs numeric: use column names. That is, if a column name (in the header row) is enclosed in quotes, then it's a string field. Otherwise, it's numeric; and float vs integer can be determined by the presence or absence of a decimal point in the first data row (after the header). This would mean that only the header would need to involve quotes; I realize that the first option I described explicitly mentions using the first row of input data, so, strictly speaking, subsequent rows would not need such highlighting, but I'm slightly more "squeamish" about applying such formatting to a single data row than I am about applying that formatting to the header row.

(The product already—deliberately, by design—outputs all values of float fields with a decimal point; if the value of a float field happens to be an integer, then that value is output with a trailing decimal point followed by a single zero. I get that JSON—what Logstash forwards to Elasticsearch—distinguishes between strings and numbers, but not between integers and floating-point numbers.)

Thoughts? (Here's one: forget it, CSV was never supposed to specify data types; don't try to make it.)

GrahamHannington commented 7 years ago

Some notes on my suggestion:

I've read RFC 4180.
I'm aware of various methods for annotating CSV files with data types, such as annotating column names in the header row with specific data types.
I'm not suggesting adding numeric detection to the Logstash CSV filter. I have string fields that sometimes contain numeric values. I don't want mapping mishaps.
I don't want to switch on numeric detection in Elasticsearch, but I'm okay leaving date detection switched on. (I'm happy to discuss why. I might end up changing my mind about leaving date detection switched on.)
I understand that, for best performance, I should define explicit mapping in Elasticsearch. I can imagine doing that for a few log types, but I don't want to contemplate that—at least, not now—for all of these log types.

GrahamHannington commented 6 years ago

A simple example, with a header followed by two data rows:

time,trancode,userid,resptime,dbcalls
"2017-09-21T11:35:40.123456Z","DPST","usra",1.264,3
"2017-09-21T11:35:41.654321Z","WTHD","usrb",1.003,2

resptime is floating-point
dbcalls is an integer
Other fields are strings

A variation, with quotes in the header row only:

"time","trancode","userid",resptime,dbcalls
2017-09-21T11:35:40.123456Z,DPST,usra,1.264,3
2017-09-21T11:35:41.654321Z,WTHD,usrb,1.003,2

jsvd commented 6 years ago

Should be easy to leverage the converters feature in the csv library. Changing line:

values = CSV.parse_line(source, :col_sep => @separator, :quote_char => @quote_char)

to:

values = CSV.parse_line(source, :converters => :all, :col_sep => @separator, :quote_char => @quote_char)

Makes processing go from:

{
       "column1" => "2017-09-21T11:35:41.654321Z",
    "@timestamp" => 2017-11-10T17:10:48.358Z,
       "column5" => "2",
       "column4" => "1.003",
       "column3" => "usrb",
       "column2" => "WTHD",
      "@version" => "1",
          "host" => "Joaos-MacBook-Pro-5.local",
       "message" => "\"2017-09-21T11:35:41.654321Z\",\"WTHD\",\"usrb\",1.003,2"
}

to:

{
       "column1" => "2017-09-21T11:35:41.654321Z",
    "@timestamp" => 2017-11-10T17:14:27.742Z,
       "column5" => 2,
       "column4" => 1.003,
       "column3" => "usrb",
       "column2" => "WTHD",
      "@version" => "1",
          "host" => "Joaos-MacBook-Pro-5.local",
       "message" => "\"2017-09-21T11:35:41.654321Z\",\"WTHD\",\"usrb\",1.003,2"
}

If there's any takers, it's just a matter of exposing a new setting to auto detect and set this parameter in the parse_line.

logstash-plugins / logstash-filter-csv

Introduce option to auto-convert unquoted field values to float or integer #54

Background to this suggestion

My suggestion