UTF BOM before message - Githubissues

dalai4git commented 3 years ago

I am getting a token recognition error and I think it may have to do with the presence of the BOM before the message. I couldn't find a test for this. The line that gives the error has structured data, not sure if that influences anything.

ottobackwards commented 3 years ago

Hi, thanks for the the report. I think I see the problem with the BOM handling, and I do not have a test for it. Can you attach a 'sanitized' sample syslog line that is failing?

Also, do you know about https://github.com/palindromicity/simple-syslog ?

dalai4git commented 3 years ago

Hi, I didn't know of the other library, but this one is also just a transient dependency for me. I am not a developer, I just tried to debug a problem with Apache Nifi. I am attaching a file with a BOM similarly formatted to those found in our logs.

log_with_bom.txt

ottobackwards commented 3 years ago

Did you log a jira with Nifi? If so please give my the number so I can assign it to myself After I fix and push, I'll do a PR to nifi picking up the fix

ottobackwards commented 3 years ago

ok, what is happening is that: the BOM you have FEFF is being interpreted as zero width non-breaking space character. they are both FE FF

https://www.unicode.org/faq/utf_bom.html#bom1 https://www.fileformat.info/info/unicode/char/feff/index.htm

https://tools.ietf.org/html/rfc5424#page-8 only specifically mentions the utf-8 bom, not the utf-16 that would be in this message. It is actually not super clear to me from the spec and the appendix that utf-16 is supported.

Where is this syslog from?

ottobackwards commented 3 years ago

https://tools.ietf.org/html/rfc5424#section-6.4

dalai4git commented 3 years ago

Sorry for not replying before, I was away for a few days and completely forgot about this. I didn't report this to nifi, we ended up just filtering the extra bytes out.

The log comes from a custom application, it may well be that there is a problem with how it is generated. When I do a hexdump of the file I sent you I get the following (cropped to the interesting part):

000000e0  49 44 3d 22 32 30 32 32  22 5d 20 ef bb bf 52 65  |ID="2022"] ...Re|

So there is a byte sequence of EF BB BF present. I just run a quick test with an rsyslog configuration. When I add $bom in the template it also adds these bytes there.

ottobackwards commented 3 years ago

Sorry, was not using a hex editor. So, by the time I get the character data is is FE FF not EF BB BF. Not sure if this is an antlr thing or not. I'll keep looking at it

dalai4git commented 3 years ago

Hmm, it seems that they are equal representations of the same thing - the FEFF is the unicode representation. In python:

>>> bom = b'\xEF\xBB\xBF'
>>> bom
b'\xef\xbb\xbf'
>>> bom.decode('utf-8')
'\ufeff'

recype-io commented 3 years ago

I do have a similar issue. Actually, I would not care about the BOM character. But I want to parse RFC#5424 lines that contain UTF-8 chars, e.g. umlauts and others. According to the cited RFC#5424 specs, the body part MUST then start with the Unicode character u+FEFF - which is 0xefbbbf, when encoded as UTF-8.

I'm also using this as part of Nifi. Nifi fails, whenever there is this BOM character or any non-ASCII UTF-8 character (with or without BOM).

I was able to reproduce the issue with modified unit tests in this module. I attached two log lines with umlaut chars - one with and one without BOM. The files are zipped, so that they cannot be modified just by uploading as "text/plain"...

log_utf8.txt.zip log_utf8_noBOM.txt.zip

ottobackwards commented 3 years ago

Thank you, I will be addressing this as soon as I can. It may be a couple of days as my development computer is in for repairs. May I use these lines you have provided in tests?

recype-io commented 3 years ago

Sure, if they are helpful.

BTW: Related unit tests in Nifi (BaseStrictSyslog5424ParserTest) are somewhat strange. The message body is e.g. "BOM'su root' ... with the literal string BOM where there should be a BOM character.

rei-ber commented 3 years ago

same experiences in nifi as @recype-io already described. a workaround is to replace any existing bom character with an empty string. and if there are umlauts in a log line, which raises a failure in the processor, there is a grok expression which handles the log in a right way..but it seems a bit ineffective..

dalai4git commented 3 years ago

BTW: Related unit tests in Nifi (BaseStrictSyslog5424ParserTest) are somewhat strange. The message body is e.g. "BOM'su root' ... with the literal string BOM where there should be a BOM character.

Looks like the example from the RFC

ottobackwards commented 3 years ago

OK. I got my machine back and have a fix such that:

@Test
  public void testBomParsingUmlauts() throws Exception {
    Map<String, Object> map = handleFile("src/test/resources/log_utf8_umlauts.txt", NilPolicy.OMIT,
        StructuredDataPolicy.MAP_OF_MAPS, EnumSet.of(AllowableDeviations.PRIORITY));
    Assert.assertNotNull(map);
  Assert.assertTrue(map.get("syslog.message").toString().contains("äöü"));
  }

works.

bintray is closing down, so I need to move over to mvn central and do a release. I'll update. After the fix is in, I'll do a PR to NIFI to fix.

Thanks for the patience.

ottobackwards commented 3 years ago

fix:

<dependency>
  <groupId>com.github.palindromicity</groupId>
  <artifactId>simple-syslog-5424</artifactId>
  <version>0.0.16</version>
</dependency>

note, this release is in mvn central not bintray

Thanks for all the information

ottobackwards commented 3 years ago

For the folks using Nifi: https://github.com/apache/nifi/pull/4978 has landed today, so if you build main you can get it, else it will be in the next release, whenever that is.

palindromicity / simple-syslog-5424

UTF BOM before message #33