RFC5424: error on non UTF-8 free-form message

influxdata / go-syslog

Blazing fast syslog parser

MIT License

476 stars 69 forks source link

RFC5424: error on non UTF-8 free-form message #21

Closed redrampage closed 4 years ago

redrampage commented 5 years ago

Hi, There's seems to be a problem with parsing of RFC5424 messages, that contain non-UTF8 bytes/sequences in free-form message field (MSG). Parser returns following error:

expecting a free-form optional message in UTF-8 (starting with or without BOM)

But according to RFC5424 this field may contain data in any encoding. Could you please make parser more relaxed about that issue?

Thanks!

leodido commented 5 years ago

Hello @redrampage, the parser at the moment simply implements what grammar mandates.

MSG             = MSG-ANY / MSG-UTF8
MSG-ANY         = *OCTET ; not starting with BOM
MSG-UTF8        = BOM UTF-8-STRING
BOM             = %xEF.BB.BF
UTF-8-STRING    = *OCTET ; UTF-8 string as specified
                         ; in RFC 3629
OCTET           = %d00-255

Anyway the idea to implement an option that disables the rejection of invalid UTF8 sequences is a good idea, imho.

@goller WDYT?

goller commented 5 years ago

Hey @leodido I go back and forth if we should accept invalid UTF8 sequences.

On one hand there are many, many loggers that do not get the format correct so it would be nice to help library users; on the other hand I worry that allowing invalid UTF-8 sequences would decrease performance.

Would allowing invalid UTF-8 decrease performance substantially?

leodido commented 5 years ago

Hey @goller, first of all let's clarify that this feature will eventually be a parsing option (off by default).

Then, my reasoning about the performances.

My intuition is that with this option on, the performances would not decrease at all.

This because in such case the number of edges and arcs of the generated FSA is lower than the case in which we check for valid/accepted UTF-8 sequences. Thus, I expect the parsing in this case to have at least the same performances (to be conservative).

Anyway, it's worth a try also to verify if my intuition is wrong or not :)

leodido commented 4 years ago

/assign

(when I'll have some spare time) :D