logstash-plugins / logstash-codec-multiline

Apache License 2.0
7 stars 31 forks source link

Multiline Codec Dropping Last Line in XML Files #3

Closed suyograo closed 8 years ago

suyograo commented 9 years ago

From JIRA https://logstash.jira.com/browse/LOGSTASH-2124

Seems like a similar issue was fixed in multiline codec by @colinsurprenant

I am trying to ingest many xml files. I am using the file input with the multiline codec to group all lines of the xml file into one log entry and then filter using the xml filter. The filter fails to parse the xml. This is due to the fact that the very last line of the file is missing in the message field (the closing xml tag). Below are some gists to show the issue. config: https://gist.github.com/clay584/10754518 sample input file: https://gist.github.com/clay584/10754389 sample output that is missing the last line of the file: https://gist.github.com/clay584/10754659 As you can see, the last line which contains , is missing. This last line does not show up in any previous or future log entries. It's just gone.

driskell commented 9 years ago

Just a thought. Do your files end in a newline?

I saw something before where the XML files did not end in a newline and so Logstash Forwarder, Log Courier and Logstash all ignored it - they would see it as "unfinished write" and wait for the newline to appear.

doubret commented 9 years ago

There is a problem with charset handling. I have had the problem when using NxLog, lines end with \r\n and the codec splits lines using only \n.

I modified the code to use BufferedTokenzer as in the line codec (https://github.com/elastic/logstash/blob/master/lib/logstash/util/buftok.rb).

I'll send the code tomorrow.

driskell commented 9 years ago

@doubret are you sure that is related here? It would just result in events with dangling \r and I don't see how that would affect OP's problem? Maybe needs a separate issue?

doubret commented 9 years ago

The \r is probably stripped somewhere in the pipeline, i need to check. In the codec the event doesnt exist yet, it's just raw buffers.

Le 19 mars 2015 à 07:46, Jason Woods notifications@github.com a écrit :

@doubret are you sure that is related here? It would just result in events with dangling \r and I don't see how that would affect OP's problem? Maybe needs a separate issue?

— Reply to this email directly or view it on GitHub.

guyboertje commented 8 years ago

The OP problem is fixed with the inclusion of auto flush (v2.0.9). Or possibly close_older if using the file input. (v2.2.1)

I am including an explanation for readers from the future.

When trying to match multiline records that have a distinct begin and end pattern its better to negate match on the begin pattern as the OP has done. For the XML in the OP gist:

<?xml version="1.0"?>
<taxfile id="1692376550">
    ...
</taxfile>
    codec => multiline {
      pattern => "^<\?xml .*\?>"
      negate => true
      what => "previous"
    }

If there is one or more XML documents per file, without auto_flush or close_older (file input), all the lines of the last XML document are buffered and because a new matching line never arrives an event is never emitted until LS is stopped.

This is why we brought in auto_flush.

But let me explain the file input behaviour first. The introduction of close_older allows the user to set a value in seconds, e.g. 10 seconds. When reading a file it is opened and read, we track the time of each read - so 10 seconds later, the file input closes the file and flushes its codec - which generates the event from the last XML doc stuck in the buffer. If new content is seen in the file, it is reprocessed from the last read position. There is no need to use auto_flush in this case. The two cases are file tailing and file reading.

However if multiline codec is used with a different input then one would use the auto_flush config setting. It will do the same flush as before but in this case the multiline codec is tracking when the last line was buffered. If no more lines are seen for the auto_flush_interval seconds duration then the lines are flushed and a event is generated.

razik29 commented 7 years ago

Hi All,

I am using Logstash 5.2, and facing the same problem. The last line of the xml file never gets read hence producing the xml parsing error when using xml filter plugin. I did not understand from the previous comment if this problem is fixed or not.

Input XML file:

<?xml version="1.0"?>
<taxfile id="1692376550">
    ...
</taxfile>

Logstash conf:

file {
        path => "C:/Input/*.xml"
        start_position => beginning
        codec => multiline {    
              pattern => "^<\?xml .*\?>"
              negate => true
              what => "previous"
              auto_flush_interval => 1
              }
        type => "xml_type"
    }

filter {
    xml {
        source => "message"
        target => "content"
    }
}

I have a couple of questions,

  1. Could you please let me know, if reading an XML file (proper file without \n at the last line) and parsing with XML filter possible or not?

  2. I know that adding a new line manually at the end of the file makes it fully readable by the logstash. But in my use case, I can't afford to update the file manually. So, is there a way logstash can add a newline at the end of the file before reading it as multiline?

Thanks Razik

guyboertje commented 7 years ago

This is not a bug in the multiline codec but it seems to be - the file input has buffered the last piece of text while waiting for the newline so the multiline codec never receives it.

Please use filebeat to solve this. See the close_eof option. https://www.elastic.co/guide/en/beats/filebeat/current/index.html

You can still use the xml filter to decode the XML.

ragingCow commented 7 years ago

Hi,guyboertje, I am facing the same problem as razik29, here is my configure:

codec => multiline{
               pattern => "^%{TIME}"  
               negate => true
              what => "previous"   
              auto_flush_interval => 1   
 }

Reading data from Kafka or stdin,either the last log will be missing. Event I append a new blank line after the last log. It seems auto_flush_interval does not work.

In product env, logstash will get data from kafka, how can I solve this problem?

My Logstash version is 5.2.2

guyboertje commented 7 years ago

@ragingCow Please note that the multiline codec should only be used for inputs that supply data that is line orientated. The kafka input build events directly from the JSON received so the multiline codec does not work for the kafka input.

ragingCow commented 7 years ago

@guyboertje Get it and thanks