`autodetect_column_names` does not work with multiple worker threads

logstash-plugins / logstash-filter-csv

Apache License 2.0

15 stars 41 forks source link

`autodetect_column_names` does not work with multiple worker threads #65

Open danhermann opened 6 years ago

danhermann commented 6 years ago

There's a race condition with the autodetect_column_names feature when there is more than one worker thread. The filter assumes that the first line of the CSV contains the column names but with multiple worker threads, the filter may receive lines in a different order than they are presented in the input. skip_header may have a similar problem.

Noticed while debugging the java.lang.ArrayIndexOutOfBoundsException that was mentioned in this blog post:

https://mikehillwig.com/2018/02/23/making-peace-with-logstash-part-2-parsing-a-csv/

jsvd commented 6 years ago

This is likely a result of ordering on logstash 6+ not being guaranteed when inserting into the queue between inputs and filters+outputs. Proper fix requires synchronization across all threads, essentially a rearchitecture of a big portion of the filter. The only current workaround is indeed setting workers => 1, features like autodetect_column_names shouldn't rely on event ordering, as we don't guarantee it, specially for workers > 1.

siben168 commented 6 years ago

thanks for pointing this out, it helped to fix my issue that autodetect_column_names always messed up my mapping. I've set my workers => 1 to fix my issue.

However, currently I use config in "logstash.yml" to set "pipeline.workers: 1", it impacted every pipeline, is there any configuration item i could use in a specfic pipeline.conf? because by doing that i could only use 1 worker for csv input that needs autodetect_column_names feature.

another issue is when i have 2 files, each file has a header, the header of second file will still be loaded, is there any way to deal with that?

danhermann commented 6 years ago

@siben168, you can set the number of workers on each pipeline in the pipelines.yml file. See more details here: https://www.elastic.co/guide/en/logstash/current/multiple-pipelines.html

Unfortunately, as the filter is currently written, I don't know of a way to handle multiple files where each one has its own header file.

pmb311 commented 6 years ago

I'm experiencing this in logstash 5.6.2 as well.

ensslen commented 6 years ago

I discussed this issue on the Elastic forums.

colinsurprenant commented 4 years ago

Note that the new csv codec should be more appropriate for this - in particular, when paired with the file input it will also use a separate codec instance per-file thus able to correctly adjust the columns per potentially different files.