Closed redrampage closed 4 years ago
Hello @redrampage, the parser at the moment simply implements what grammar mandates.
MSG = MSG-ANY / MSG-UTF8
MSG-ANY = *OCTET ; not starting with BOM
MSG-UTF8 = BOM UTF-8-STRING
BOM = %xEF.BB.BF
UTF-8-STRING = *OCTET ; UTF-8 string as specified
; in RFC 3629
OCTET = %d00-255
Anyway the idea to implement an option that disables the rejection of invalid UTF8 sequences is a good idea, imho.
@goller WDYT?
Hey @leodido I go back and forth if we should accept invalid UTF8 sequences.
On one hand there are many, many loggers that do not get the format correct so it would be nice to help library users; on the other hand I worry that allowing invalid UTF-8 sequences would decrease performance.
Would allowing invalid UTF-8 decrease performance substantially?
Hey @goller, first of all let's clarify that this feature will eventually be a parsing option (off by default).
Then, my reasoning about the performances.
My intuition is that with this option on, the performances would not decrease at all.
This because in such case the number of edges and arcs of the generated FSA is lower than the case in which we check for valid/accepted UTF-8 sequences. Thus, I expect the parsing in this case to have at least the same performances (to be conservative).
Anyway, it's worth a try also to verify if my intuition is wrong or not :)
/assign
(when I'll have some spare time) :D
Hi, There's seems to be a problem with parsing of RFC5424 messages, that contain non-UTF8 bytes/sequences in free-form message field (MSG). Parser returns following error:
But according to RFC5424 this field may contain data in any encoding. Could you please make parser more relaxed about that issue?
Thanks!