Closed dalai4git closed 3 years ago
Hi, thanks for the the report. I think I see the problem with the BOM handling, and I do not have a test for it. Can you attach a 'sanitized' sample syslog line that is failing?
Also, do you know about https://github.com/palindromicity/simple-syslog ?
Hi, I didn't know of the other library, but this one is also just a transient dependency for me. I am not a developer, I just tried to debug a problem with Apache Nifi. I am attaching a file with a BOM similarly formatted to those found in our logs.
Did you log a jira with Nifi? If so please give my the number so I can assign it to myself After I fix and push, I'll do a PR to nifi picking up the fix
ok, what is happening is that: the BOM you have FEFF is being interpreted as zero width non-breaking space character. they are both FE FF
https://www.unicode.org/faq/utf_bom.html#bom1 https://www.fileformat.info/info/unicode/char/feff/index.htm
https://tools.ietf.org/html/rfc5424#page-8 only specifically mentions the utf-8 bom, not the utf-16 that would be in this message. It is actually not super clear to me from the spec and the appendix that utf-16 is supported.
Where is this syslog from?
Sorry for not replying before, I was away for a few days and completely forgot about this. I didn't report this to nifi, we ended up just filtering the extra bytes out.
The log comes from a custom application, it may well be that there is a problem with how it is generated. When I do a hexdump of the file I sent you I get the following (cropped to the interesting part):
000000e0 49 44 3d 22 32 30 32 32 22 5d 20 ef bb bf 52 65 |ID="2022"] ...Re|
So there is a byte sequence of EF BB BF present. I just run a quick test with an rsyslog configuration. When I add $bom
in the template it also adds these bytes there.
Sorry, was not using a hex editor. So, by the time I get the character data is is FE FF not EF BB BF. Not sure if this is an antlr thing or not. I'll keep looking at it
Hmm, it seems that they are equal representations of the same thing - the FEFF is the unicode representation. In python:
>>> bom = b'\xEF\xBB\xBF'
>>> bom
b'\xef\xbb\xbf'
>>> bom.decode('utf-8')
'\ufeff'
I do have a similar issue. Actually, I would not care about the BOM character. But I want to parse RFC#5424 lines that contain UTF-8 chars, e.g. umlauts and others. According to the cited RFC#5424 specs, the body part MUST then start with the Unicode character u+FEFF - which is 0xefbbbf, when encoded as UTF-8.
I'm also using this as part of Nifi. Nifi fails, whenever there is this BOM character or any non-ASCII UTF-8 character (with or without BOM).
I was able to reproduce the issue with modified unit tests in this module. I attached two log lines with umlaut chars - one with and one without BOM. The files are zipped, so that they cannot be modified just by uploading as "text/plain"...
Thank you, I will be addressing this as soon as I can. It may be a couple of days as my development computer is in for repairs. May I use these lines you have provided in tests?
Sure, if they are helpful.
BTW: Related unit tests in Nifi (BaseStrictSyslog5424ParserTest) are somewhat strange. The message body is e.g. "BOM'su root' ...
with the literal string BOM
where there should be a BOM character.
same experiences in nifi as @recype-io already described. a workaround is to replace any existing bom character with an empty string. and if there are umlauts in a log line, which raises a failure in the processor, there is a grok expression which handles the log in a right way..but it seems a bit ineffective..
BTW: Related unit tests in Nifi (BaseStrictSyslog5424ParserTest) are somewhat strange. The message body is e.g.
"BOM'su root' ...
with the literal stringBOM
where there should be a BOM character.
Looks like the example from the RFC
OK. I got my machine back and have a fix such that:
@Test
public void testBomParsingUmlauts() throws Exception {
Map<String, Object> map = handleFile("src/test/resources/log_utf8_umlauts.txt", NilPolicy.OMIT,
StructuredDataPolicy.MAP_OF_MAPS, EnumSet.of(AllowableDeviations.PRIORITY));
Assert.assertNotNull(map);
Assert.assertTrue(map.get("syslog.message").toString().contains("äöü"));
}
works.
bintray is closing down, so I need to move over to mvn central and do a release. I'll update. After the fix is in, I'll do a PR to NIFI to fix.
Thanks for the patience.
fix:
<dependency>
<groupId>com.github.palindromicity</groupId>
<artifactId>simple-syslog-5424</artifactId>
<version>0.0.16</version>
</dependency>
note, this release is in mvn central not bintray
Thanks for all the information
For the folks using Nifi: https://github.com/apache/nifi/pull/4978 has landed today, so if you build main you can get it, else it will be in the next release, whenever that is.
I am getting a token recognition error and I think it may have to do with the presence of the BOM before the message. I couldn't find a test for this. The line that gives the error has structured data, not sure if that influences anything.