logstash-plugins / logstash-filter-csv

Apache License 2.0
15 stars 41 forks source link

skip_header => "true" not working #68

Open espogian opened 6 years ago

espogian commented 6 years ago

Hi this issue is apparently always showing when using the following Logstash filter configuration:

filter {
    csv {
        skip_header => "true"
        columns => [
            ...

Even using the skip_header => "true" option, Logstash is indexing the very first row of the CSV. Reproduced with Logstash 6.3.0

MorrieAtElastic commented 6 years ago

I have a customer seeing this problem in 6.2.4; I have subsequently reproduced it in both 6.2.4 and 6.4.0:

Test configuration file:

input { 
file { 
path => [/home/TestData/example.csv"] 
start_position => "beginning" 
sincedb_path => "/dev/null" 
} 
} 

filter { 
csv { 
columns => ["FLOW_FILE_ID", "ID", "Name", "Year"] 
separator => "," 
add_field => { "SESSION_ID" => "%{FLOW_FILE_ID}%{ID}" } 
skip_header => "true" 
} 
mutate {convert => ["Year", "integer"]} 

} 

output { 
elasticsearch { 
action => "index" 
hosts => "localhost:9200" 
index => "example" 
user => "elastic" 
password => "password" 
workers => 1 
} 
stdout {} 
} 

Test Data - example.csv

FLOW_FILE_ID,ID,Name,Year 
123,1,Kiano Reevs,1976 
456,2,Lee Childs ,1978
guyboertje commented 6 years ago

@espogian @MorrieAtElastic

The skip_headers code is:

        if (@skip_header && (!@columns.empty?) && (@columns == values))
          event.cancel
          return
        end

The above means that the columns setting must be specified and the columns array value must be exactly the same as the parsed line as an array.

Please verify that there are no extra spaces or similar in the sample data header line.

If there were spaces lurking in the data, then you should add those spaces to the columns setting too.

guyboertje commented 6 years ago

This test config is successful

input {
  generator {
    lines => ["red,amber,green", "stop,wait for it,go go go - man"]
    count => 1
  }
}

filter {
  csv {
    separator => ","
    columns => ["red","amber","green"]
    skip_header => "true"
  }
}

output {
  stdout {
    codec => rubydebug
  }
}

Results

bin/logstash -f /elastic/tmp/testing/confs/test-csv-skip_headers.conf
Sending Logstash's logs to /elastic/tmp/logstash-6.3.2/logs which is now configured via log4j2.properties
[2018-08-28T10:21:08,573][WARN ][logstash.config.source.multilocal] Ignoring the 'pipelines.yml' file because modules or command line options are specified
[2018-08-28T10:21:09,194][INFO ][logstash.runner          ] Starting Logstash {"logstash.version"=>"6.3.2"}
[2018-08-28T10:21:12,276][INFO ][logstash.pipeline        ] Starting pipeline {:pipeline_id=>"main", "pipeline.workers"=>8, "pipeline.batch.size"=>125, "pipeline.batch.delay"=>50}
[2018-08-28T10:21:12,367][INFO ][logstash.pipeline        ] Pipeline started successfully {:pipeline_id=>"main", :thread=>"#<Thread:0x64dac7c2 sleep>"}
[2018-08-28T10:21:12,438][INFO ][logstash.agent           ] Pipelines running {:count=>1, :running_pipelines=>[:main], :non_running_pipelines=>[]}
[2018-08-28T10:21:12,968][INFO ][logstash.agent           ] Successfully started Logstash API endpoint {:port=>9600}
{
           "red" => "stop",
      "@version" => "1",
         "amber" => "wait for it",
         "green" => "go go go - man",
       "message" => "stop,wait for it,go go go - man",
    "@timestamp" => 2018-08-28T09:21:12.411Z,
          "host" => "Elastics-MacBook-Pro.local",
      "sequence" => 0
}
[2018-08-28T10:21:13,079][INFO ][logstash.pipeline        ] Pipeline has terminated {:pipeline_id=>"main", :thread=>"#<Thread:0x64dac7c2 run>"}
MorrieAtElastic commented 6 years ago

I was able to identify a leading space in the header line based on the debug output. Check the leading space in the "message" field below. When I eliminated that trailing space the "skip_header" function did work.

{ "@version" => "1", "message" => "FLOW_FILE_ID,ID,Name,Year ", "FLOW_FILE_ID" => "FLOW_FILE_ID", "Year" => 0, "Name" => "Name", "path" => "/home/morrie/TestData/example4.csv", "@timestamp" => 2018-08-28T10:18:44.401Z, "ID" => "ID", "host" => "morrie-Elasti" }

My problem with this is that this is a rather weak definition of a header. It seems to me that the header is by definition the first line of the file. Is an enhancement request reasonable to redefine the meaning of "header" for a CSV file, as it exists in the code?

espogian commented 6 years ago

@guyboertje when you specify that "the columns array value must be exactly the same as the parsed line as an array", do you mean that the columns names specified in Logstash configuration should be exactly the same of the first row of the file? In other words, it is not possible to change columns names in this particular situation?

MorrieAtElastic commented 6 years ago

My customer states that he "completely agrees with" my assessment:

"This logic is incorrect. All they need to do in this case is to skip the first line of the CSV file. This case is error prone and causing us spend a lot of time fixing something that not suppose to be fixed in the first place. I appreciate it if they can supply a simpler solution."

The definition of a header is the first line of the file. If this is problematic then provide a parameter which allows customers to define how many lines their headers are and what - if any offset - the header occupies in the file. But the current code definition of header is quite problematic and does not reflect the reality of headers in a CSV file. This is something that should be changed as soon as reasonably possible.

@guyboertje - if you'd prefer me to raise an enhancement request, please let me know in which repo to file it. If you prefer to file said enhancement request yourself, please do so at your first convenience. But the current processing logic is definitely wrong and should be changed.

And thanks for moving on this so quickly.

guyboertje commented 6 years ago

@espogian @MorrieAtElastic

How does the CSV filter know it is looking at the first line? The CSV filter is not exclusively associated with files.

With all the above considered, the best we could do was to allow the user to specify the line to be skipped.

In other words, it is not possible to change columns names in this particular situation?

This is perhaps the biggest oversight in the current implementation. If any enhancement is needed it is to add a new setting skip_lines_file_path where users can define a file of strings that are (full) lines to skip or (TBD) a regex like string that if matched to a line will skip it.

MorrieAtElastic commented 6 years ago
guyboertje commented 6 years ago

@MorrieAtElastic

every CSV file can be expected have a header

While I agree with that statement, however, without user supplied hints, the plugin really can't know for certain whether the current line it is processing is a header line or not.

I like the regex idea. I'm on the fence about whether it should be a file (with periodic reloading) or an array. If a file then users can add to the file while LS is running etc.

An alternative is to add a conditional before the CSV filter...

  if [message] =~ "^(start-of-header1|start-of-header2)" {
    drop {}
  }
marksparrish commented 5 years ago

I think the best way to handle "noise" in your file is to use the drop filter if [FIRST_COLUMN] in ["FIRST_COLUMN"] { drop { } } This way you can add more "noise" as needed. You can also do this in the output by writing the "noise" to a file for later inspection to see if it needs to be handled. ` if [FIRST_COLUMN] in ["FIRST_COLUMN"] {

write to noise file

} else {
            #write to es
    }

` Maybe the csv filter could provide a "noise" setting to automatically do this, but then you would lose the ability branch the output.

colinsurprenant commented 4 years ago

The new cvs codec should help with this, in particular, when paired with the file input, a separate codec instance will be used per-file where different files can have different headers .