rsyslog / liblognorm

a fast samples-based log normalization library
http://www.liblognorm.com
GNU Lesser General Public License v2.1
99 stars 64 forks source link

add CSV motif parser #74

Open rgerhards opened 9 years ago

rgerhards commented 9 years ago

note that CSV is not well defined, we need a couple of parameters to describe it. This may also be used to parse things like W3C logs (maybe...).

rgerhards commented 9 years ago

@radu-gheorghe I know you occasionally work with CSV formats. What would you say how the fields should be named? Would it be OK to somehow provide a list of field names to the parser? If so, what to do if an acutal message does have less more fields than names are configured?

radu-gheorghe commented 9 years ago

A list of columns probably has to be specified in the config somehow - I don't see another way for production. For development, I like the way Logstash's CSV parser allows you to skip defining columns and just naming them column1, column2, etc for you. This can also happen if the message itself has more columns than defined. Not sure if anyone uses this in prod, but it's definitely clear for getting started (e.g. if you just want to benchmark the thing and see if you get enough performance or go down a different route).

If the message has more fields than configured, it would probably be understandable that the parsing would fail unless something like rest would be configured.

davidelang commented 9 years ago

On Thu, 9 Jul 2015, Radu Gheorghe wrote:

A list of columns probably has to be specified in the config somehow - I don't see another way for production. For development, I like the way Logstash's CSV parser allows you to skip defining columns and just naming them column1, column2, etc for you. This can also happen if the message itself has more columns than defined. Not sure if anyone uses this in prod, but it's definitely clear for getting started (e.g. if you just want to benchmark the thing and see if you get enough performance or go down a different route).

also take a look at what nxlog offers for csv files. from all the options they give it looks like it's something that they've had beaten on a lot (along with their name-value pair parser)

If the message has more fields than configured, it would probably be understandable that the parsing would fail unless something like rest would be configured.

I'd say that it should be a config option to be strict about the number of columns. I think that it's reasonable for the default to be a 'best effort' type of thing because when parsing something, you will be matching a particular log source (probably imfile with a provided syslogtag), and not trying to infer the type of log by the number of columns.

If there is less data than names, either ignore the extra names or assign '' to them

if there is more data than names, column1,column2 default names sounds like a good idea.

the other option that may be useful is to turn the csv items into an array rather than a set of name-value pairs.

David Lang

missnebun commented 7 years ago

Also the ability to set a separator option would be a nice feature.

davidelang commented 5 years ago

note that we now have the quoted strong parser, which helps with this (I don't remember if we had this back when this issue was opened)

if we implement the optional parser motif ( https://github.com/rsyslog/liblognorm/issues/86 ) this csv motif almost becomes a wrapper to handle column naming/mismatched column count, but otherwise basically being the recursive:

optional(qstring) repeating(optional(","optional(qstring)))