hannesm / jackline

minimalistic secure XMPP client in OCaml
BSD 2-Clause "Simplified" License
250 stars 20 forks source link

XML escaped entities #150

Open hannesm opened 7 years ago

hannesm commented 7 years ago

it seems to be completely unclear what the escaping/encoding of OTR encrypted messages via XMPP should be, especially how <, >, ", & should be encoded:

This is crucial for transferring <3 sequences.

AFAICS, this is an issue off-topic for the otr spec, and there's no spec for OTR over XMPP.

Jackline does follow the libpurple style while sending data, during receiving data it does one step of unescaping (and thus usually does the right thing, unless intentionally an irssi user send over &lt;, which will be rendered as < in jackline).

Jackline could try to be smart and figure out per contact what escaping to use (or have a per contact config option, since there doesn't seem to be a reliable way to figure this out programmatically). What should be the default, in the end?

hannesm commented 7 years ago

on a further note, slack's xmpp gateway has the following behaviour:

[12:50:42] OUT TLS: <message type='groupchat'  id='WqTdGuPDSre+OcLx'><body>(ignore me, just testing some slack escaping here... &lt; &amp;gt;)</body></message>
[12:50:42] IN TLS: <message xmlns="jabber:client" type="groupchat" ts="1486039843.000139"><body>(ignore me, just testing some slack escaping here... &lt; &gt;)</body></message>
[12:51:02] IN TLS:  
[12:51:04] OUT TLS: <message type='groupchat' id='oNNUz60GgtYfXGhS'><body>hmm, but &apos; &amp; ?</body></message>
[12:51:05] IN TLS: <message xmlns="jabber:client" type="groupchat" ts="1486039866.000140"><body>hmm, but ' &amp; ?</body></message>
[12:51:31] OUT TLS: <message type='groupchat' id='0pPSK0ckOWZUe4ld'><body>and &quot; ?</body></message>
[12:51:32] IN TLS: <message xmlns="jabber:client" type="groupchat" ts="1486039892.000141"><body>and " ?</body></message>

which means that it automatically unescapes &apos; and &quot; to ' respectively ", but leaves other XML entities alone. Unescaped " and ' (using /raw 2227) come back the same way, < and > (3c3e) are returned as &lt; and &gt;.

(we could decide to not escape anything based on the domain name in this case)

dbuenzli commented 7 years ago

which means that it automatically unescapes ' and " to ' respectively ", but leaves other XML entities alone.

I guess it simply unescapes predefined entities and character references, that is it does what any decent XML parser must do.

As for other entities, if xmpp doesn't define any in its vocabulary, I'd suggest to simply replace unknown entity references by the Unicode replacement character U+FFFD and log the offending entity name somewhere.

hannesm commented 7 years ago

@dbuenzli thanks for the references.

Jackline does indeed replace unknown (and control characters) with U+FFFD. XMPP (and OTR over XMPP) do not specify escaping sufficiently precise - the behaviour of client implementations differs.

(mcabber seems to not display any (unencrypted) messages sent by jackline containing < (which is escaped to &lt; before sent to the server) - nothing to fix in jackline, but nevertheless inconvenient for users)

dbuenzli commented 7 years ago

XMPP (and OTR over XMPP) do not specify escaping sufficiently precise - the behaviour of client implementations differs.

I don't know the standards and the exact context so take this with a grain of salt -- I could have a look if you can point to what you think is unprecise -- but I suspect that this is misunderstanding of XML by implementers rather than unpreciseness.