Open espogian opened 6 years ago
I have a customer seeing this problem in 6.2.4; I have subsequently reproduced it in both 6.2.4 and 6.4.0:
Test configuration file:
input {
file {
path => [/home/TestData/example.csv"]
start_position => "beginning"
sincedb_path => "/dev/null"
}
}
filter {
csv {
columns => ["FLOW_FILE_ID", "ID", "Name", "Year"]
separator => ","
add_field => { "SESSION_ID" => "%{FLOW_FILE_ID}%{ID}" }
skip_header => "true"
}
mutate {convert => ["Year", "integer"]}
}
output {
elasticsearch {
action => "index"
hosts => "localhost:9200"
index => "example"
user => "elastic"
password => "password"
workers => 1
}
stdout {}
}
Test Data - example.csv
FLOW_FILE_ID,ID,Name,Year
123,1,Kiano Reevs,1976
456,2,Lee Childs ,1978
@espogian @MorrieAtElastic
The skip_headers code is:
if (@skip_header && (!@columns.empty?) && (@columns == values))
event.cancel
return
end
The above means that the columns
setting must be specified and the columns
array value must be exactly the same as the parsed line as an array.
Please verify that there are no extra spaces or similar in the sample data header line.
If there were spaces lurking in the data, then you should add those spaces to the columns
setting too.
This test config is successful
input {
generator {
lines => ["red,amber,green", "stop,wait for it,go go go - man"]
count => 1
}
}
filter {
csv {
separator => ","
columns => ["red","amber","green"]
skip_header => "true"
}
}
output {
stdout {
codec => rubydebug
}
}
Results
bin/logstash -f /elastic/tmp/testing/confs/test-csv-skip_headers.conf
Sending Logstash's logs to /elastic/tmp/logstash-6.3.2/logs which is now configured via log4j2.properties
[2018-08-28T10:21:08,573][WARN ][logstash.config.source.multilocal] Ignoring the 'pipelines.yml' file because modules or command line options are specified
[2018-08-28T10:21:09,194][INFO ][logstash.runner ] Starting Logstash {"logstash.version"=>"6.3.2"}
[2018-08-28T10:21:12,276][INFO ][logstash.pipeline ] Starting pipeline {:pipeline_id=>"main", "pipeline.workers"=>8, "pipeline.batch.size"=>125, "pipeline.batch.delay"=>50}
[2018-08-28T10:21:12,367][INFO ][logstash.pipeline ] Pipeline started successfully {:pipeline_id=>"main", :thread=>"#<Thread:0x64dac7c2 sleep>"}
[2018-08-28T10:21:12,438][INFO ][logstash.agent ] Pipelines running {:count=>1, :running_pipelines=>[:main], :non_running_pipelines=>[]}
[2018-08-28T10:21:12,968][INFO ][logstash.agent ] Successfully started Logstash API endpoint {:port=>9600}
{
"red" => "stop",
"@version" => "1",
"amber" => "wait for it",
"green" => "go go go - man",
"message" => "stop,wait for it,go go go - man",
"@timestamp" => 2018-08-28T09:21:12.411Z,
"host" => "Elastics-MacBook-Pro.local",
"sequence" => 0
}
[2018-08-28T10:21:13,079][INFO ][logstash.pipeline ] Pipeline has terminated {:pipeline_id=>"main", :thread=>"#<Thread:0x64dac7c2 run>"}
I was able to identify a leading space in the header line based on the debug output. Check the leading space in the "message" field below. When I eliminated that trailing space the "skip_header" function did work.
{ "@version" => "1", "message" => "FLOW_FILE_ID,ID,Name,Year ", "FLOW_FILE_ID" => "FLOW_FILE_ID", "Year" => 0, "Name" => "Name", "path" => "/home/morrie/TestData/example4.csv", "@timestamp" => 2018-08-28T10:18:44.401Z, "ID" => "ID", "host" => "morrie-Elasti" }
My problem with this is that this is a rather weak definition of a header. It seems to me that the header is by definition the first line of the file. Is an enhancement request reasonable to redefine the meaning of "header" for a CSV file, as it exists in the code?
@guyboertje when you specify that "the columns array value must be exactly the same as the parsed line as an array", do you mean that the columns names specified in Logstash configuration should be exactly the same of the first row of the file? In other words, it is not possible to change columns names in this particular situation?
My customer states that he "completely agrees with" my assessment:
"This logic is incorrect. All they need to do in this case is to skip the first line of the CSV file. This case is error prone and causing us spend a lot of time fixing something that not suppose to be fixed in the first place. I appreciate it if they can supply a simpler solution."
The definition of a header is the first line of the file. If this is problematic then provide a parameter which allows customers to define how many lines their headers are and what - if any offset - the header occupies in the file. But the current code definition of header is quite problematic and does not reflect the reality of headers in a CSV file. This is something that should be changed as soon as reasonably possible.
@guyboertje - if you'd prefer me to raise an enhancement request, please let me know in which repo to file it. If you prefer to file said enhancement request yourself, please do so at your first convenience. But the current processing logic is definitely wrong and should be changed.
And thanks for moving on this so quickly.
@espogian @MorrieAtElastic
How does the CSV filter know it is looking at the first line? The CSV filter is not exclusively associated with files.
With all the above considered, the best we could do was to allow the user to specify the line to be skipped.
In other words, it is not possible to change columns names in this particular situation?
This is perhaps the biggest oversight in the current implementation. If any enhancement is needed it is to add a new setting skip_lines_file_path
where users can define a file of strings that are (full) lines to skip or (TBD) a regex like string that if matched to a line will skip it.
I think the regex idea has merit and I have asked my customer to review it.
I think clearly we need an improvement over the current system: every CSV file can be expected have a header and we need to be able to handle the header.
@MorrieAtElastic
every CSV file can be expected have a header
While I agree with that statement, however, without user supplied hints, the plugin really can't know for certain whether the current line it is processing is a header line or not.
I like the regex idea. I'm on the fence about whether it should be a file (with periodic reloading) or an array. If a file then users can add to the file while LS is running etc.
An alternative is to add a conditional before the CSV filter...
if [message] =~ "^(start-of-header1|start-of-header2)" {
drop {}
}
I think the best way to handle "noise" in your file is to use the drop filter
if [FIRST_COLUMN] in ["FIRST_COLUMN"] { drop { } }
This way you can add more "noise" as needed.
You can also do this in the output by writing the "noise" to a file for later inspection to see if it needs to be handled.
`
if [FIRST_COLUMN] in ["FIRST_COLUMN"] {
} else {
#write to es
}
` Maybe the csv filter could provide a "noise" setting to automatically do this, but then you would lose the ability branch the output.
The new cvs codec should help with this, in particular, when paired with the file input, a separate codec instance will be used per-file where different files can have different headers .
Hi this issue is apparently always showing when using the following Logstash filter configuration:
Even using the
skip_header => "true"
option, Logstash is indexing the very first row of the CSV. Reproduced with Logstash 6.3.0