logstash-plugins / logstash-codec-multiline

Apache License 2.0
7 stars 31 forks source link

Correctly handle reads by the input module that are not aligned to a newline #38

Closed retoo closed 8 years ago

retoo commented 8 years ago

The input plugin might not align the reads by the newlines (i.e. stdin). For example stdin reads 16384 bytes and passes them to the codec, regardless if that read ends with a newline. This commit fixes this by keeping the last 'partial' line and adds it with the next decode call.

In our environment this change didn't have any noticable effect on the performance.

Fixes Issue #37.

suyograo commented 8 years ago

@retoo Can you please perform step 2 of https://github.com/elasticsearch/logstash/blob/master/CONTRIBUTING.md#contribution-steps

jordansissel commented 8 years ago

I feel this should use BufferedTokenizer since it solve this problem and is know to Logstash historically as a reusable solution for line splitting.

On Tuesday, June 14, 2016, Reto Schüttel notifications@github.com wrote:

The input plugin might not align the reads by the newlines (i.e. stdin). For example stdin reads 16384 bytes and passes them to the codec, regardless if that read ends with a newline. This commit fixes this by keeping the last 'partial' line and adds it with the next decode call.

In our environment this change didn't have any noticable effect on the performance.

Fixes Issue #37

https://github.com/logstash-plugins/logstash-codec-multiline/issues/37.

You can view, comment on, or merge this pull request online at:

https://github.com/logstash-plugins/logstash-codec-multiline/pull/38 Commit Summary

  • Correctly handle reads by the input module that are not aligned to a newline

File Changes

Patch Links:

- https://github.com/logstash-plugins/logstash-codec-multiline/pull/38.patch

https://github.com/logstash-plugins/logstash-codec-multiline/pull/38.diff

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/logstash-plugins/logstash-codec-multiline/pull/38, or mute the thread https://github.com/notifications/unsubscribe/AAIC6p73c-J9XR1zhcoLR90vrbv2XZwjks5qLtBOgaJpZM4I1fAs .

guyboertje commented 8 years ago

This has to take auto-flush into account.

I would prefer not to do this as I am sure that Event Milling will take this into account.

retoo commented 8 years ago

Yes, the autoflush part is completely broken today. On the other hand the codec is completely useless for us without the fix. Is stdin the only input plugin having this problem?

Yesterday I briefly tried to fix the tests with something simple, but failed. I'll give it another shot sometime today.

guyboertje commented 8 years ago

@retoo - Do you know how to use LS with your own fork of a plugin?

It may be easier to fork stdin and add Buffered Tokeniser like the line codec does today then use the fork.

Both stdin and multiline codec are unlikely to change very much until we deliver Event Milling and you can always pull bugfixes from upstream.

Event Milling is meant to provide a flexible mini-pipeline inside an input that allows a user to direct LS exactly how they want the data chunk to be processed.

retoo commented 8 years ago

@guyboertje I changed the stdin plugin, it's much cleaner this way.

I'll close this PR and suggest we adopt: https://github.com/logstash-plugins/logstash-input-stdin/pull/11

colinsurprenant commented 5 years ago

Until we move forward with proper boundary detection across inputs/codecs, I am proposing this solution to the multiline codec logstash-plugins/logstash-codec-multiline#63

retoo commented 5 years ago

IMHO the stdin input plugin should be thrown out of logstash if not fixed.

It is not just broken, its broken in away that it ruins your day when you finally realize that something is screwing with your log files.

colinsurprenant commented 5 years ago

@retoo You are witnessing this problem using the multiline codec right? (per #37). The stdin input is not broken in the way you think it is. It correctly reads blocks of data and passes them to the underlying codec. It is the codec's job to correctly deal with data across blocks. I have recently submitted logstash-plugins/logstash-codec-multiline#63 to fix this problem in the multiline codec.

If you are interested in helping some more with this you could try this fix locally or we could arrange to get a test build of the codec to see how it works for you?

retoo commented 5 years ago

Thanks for the explanation. I'm no longer using logstash in that particular setup, I just got a bit frustrated about seeing it still open after having a working fix in 2016 ;)

colinsurprenant commented 5 years ago

@retoo I totally understand, we'll try to do better. Thanks for your contribution.