Message parsing has platform default encoding dependent behavior

usnistgov / jsip

JSIP: Java SIP specification Reference Implementation (moved from java.net)

Other

288 stars 131 forks source link

Message parsing has platform default encoding dependent behavior #5

Open durban opened 8 years ago

durban commented 8 years ago

This line calls String#getBytes without specifying an encoding. This can cause problems depending on the default encoding of the platform.

For example, if the default encoding is US-ASCII, the getBytes call will replace some (not representable) unicode characters with question marks.

See also RestComm/jain-sip/issues/111.

vladimirralev commented 8 years ago

That's a known behavior. Is there some problem with it?

durban commented 8 years ago

Yes. For example MessageFactoryImpl.createRequest refuses to parse a perfectly legal String, like this (if the platform encoding is, e.g., US-ASCII):

PUBLISH sip:bob@biloxi.example.com SIP/2.0
Call-ID: 35516046df6aa32736ef49c28e98de93@127.0.0.1
CSeq: 1 PUBLISH
From: "Alice" <sip:alice@atlanta.example.com>;tag=9fxced76sl
To: "Bob" <sip:bob@biloxi.example.com>
Max-Forwards: 70
Content-Type: text/html
Content-Length: 4

éű

The exception is like this:

java.text.ParseException: Invalid content length 2 / 4
  at gov.nist.javax.sip.message.SIPMessage.setMessageContent(SIPMessage.java:1373)
  at gov.nist.javax.sip.parser.StringMsgParser.parseSIPMessage(StringMsgParser.java:204)
  at gov.nist.javax.sip.message.MessageFactoryImpl.createRequest(MessageFactoryImpl.java:736)
  ...

(I guess the content length mismatch is due to the 2-byte characters being replaced by 1-byte question marks.)

vladimirralev commented 8 years ago

Sure but this is expected since we have no way of knowing the correct encoding. If we hardcode something, when other encoding comes up it will fail again. Thus it's up to the sysadmin to set correct system encoding, no?

durban commented 8 years ago

Oh, I think I'm starting to understand the problem. Correct me if I'm wrong, but since the content length is in bytes, the createRequest method actually has no way of unambiguously parsing the String. Thus, it assumes the platform encoding. (Although, whether that is a good default, is debatable, I think.)

This would mean, that the createRequest method cannot be implemented correctly (by "correctly" I mean to return a correct parsed message on any JVM). If that is so, then the current behavior is unfortunate, but understandable. (Another question then is, why is that method in the API?)

I guess we have to find a way to parse directly from bytes to get a deterministic behavior ... all right, thanks for your help!