secondlife / jira-archive

2 stars 0 forks source link

[BUG-234162] "Notation" LLSD character set problem when encapsulated inside XML LLSD. #11136

Open sl-service-account opened 1 year ago

sl-service-account commented 1 year ago

What just happened?

I'm implementing glTF materials in the Sharpview viewer, which is all in Rust. Rust makes a strong distinction between UTF-8 strings and byte strings. So I have to resolve some ambiguities in the internal data representations.

The new glTF materials use XML LLSD to encapsulate Notation LLSD which, in turn, encapsulates glTF, which is stored as JSON. See attached file "materialexample.txt" Unfortunately, not all those formats use the same string encoding.

JSON is required by standard to use UTF-8 encoding. XML by default today uses UTF-8 encoding. (UTF-16 with byte order marks is still supported by the XML spec, though. Unclear if SL uses this.) Notation, though, has ambiguous encoding.

Most of the Notation syntax is unambiguous, but there is a problem with the "counted" forms of binary and strings. Those have a count, followed by N raw bytes. This can result in an invalid UTF-8 string, one which contains byte patterns not valid for UTF-8. That will probably break something if that feature is used for Materials or Puppetry and there's any way user-chosen text can get into that content. Somebody is going to stick in an emoji and cause an obscure bug.

The documented format for Notation, https://wiki.secondlife.com/wiki/LLSD has an example where an LSL script is encoded, not as a string, but as binary data with a byte count. See the example which begins

b(158)"default { state_entry() { llSay(0, "Hello, Avatar!"); } ...

The 158 is a count of bytes, not characters. from what the LL C++ code does. This looks like a legacy representation from before UTF-8, so I suppose we are stuck with that for now. But it is something to be avoided in a text-like format.

What were you doing when it happened?

Implementing LLSD in Rust for the Sharpview viewer.

What were you expecting to happen instead?

An unambiguous representation for LLSD.

Suggestions:

  1. Don't try to use the byte-counted forms of Notation LLSD for strings and binary data encapsulated inside XML LLSD. (So far, I haven't seen this done anywhere, but as more glTF features are added it might slip in.) The backslash escaped strings and base 64 binary are fine. So is binary LLSD.

  2. Preferably, don't use Notation LLSD encapsulated inside XML LLSD at all.

  3. Preferably, don't use the byte-counted forms in Notation at all.

  4. Preferably, use UTF-8 for anything that goes over the wire.

Thanks.

(I know, I'm being picky, but having to write bug-compatible new code requires analysis like this. I still have two in-world prims and three sculpts that Sharpview does not render correctly because users found usefully exploitable bugs in the existing code. LL will face all this with the mobile viewer, I expect.)

Other information

This is an internals problem for materials viewers and puppetry clients.

Attachments

Original Jira Fields | Field | Value | | ------------- | ------------- | | Issue | BUG-234162 | | Summary | "Notation" LLSD character set problem when encapsulated inside XML LLSD. | | Type | Bug | | Priority | Unset | | Status | Accepted | | Resolution | Triaged | | Labels | pbr | | Reporter | animats (animats) | | Assignee | Runitai Linden (runitai.linden) | | Created at | 2023-07-22T21:21:51Z | | Updated at | 2023-07-25T19:14:14Z | ``` { 'Build Id': 'unset', 'Business Unit': ['Platform'], 'Date of First Response': '2023-07-22T18:14:35.478-0500', "Is there anything you'd like to add?": 'This is an internals problem for materials viewers and puppetry clients. ', 'ReOpened Count': 0.0, 'Severity': 'Unset', 'System': 'SL Simulator', 'Target Viewer Version': 'viewer-development', 'What just happened?': 'I\'m implementing glTF materials in the Sharpview viewer, which is all in Rust. Rust makes a strong distinction between UTF-8 strings and byte strings. So I have to resolve some ambiguities in the internal data representations. \r\n\r\nThe new glTF materials use XML LLSD to encapsulate Notation LLSD which, in turn, encapsulates glTF, which is stored as JSON. See attached file "materialexample.txt" Unfortunately, not all those formats use the same string encoding.\r\n\r\nJSON is required by standard to use UTF-8 encoding. XML by default today uses UTF-8 encoding. (UTF-16 with byte order marks is still supported by the XML spec, though. Unclear if SL uses this.) Notation, though, has ambiguous encoding.\r\n\r\nMost of the Notation syntax is unambiguous, but there is a problem with the "counted" forms of binary and strings. Those have a count, followed by N raw bytes. This can result in an invalid UTF-8 string, one which contains byte patterns not valid for UTF-8. That will probably break something if that feature is used for Materials or Puppetry and there\'s any way user-chosen text can get into that content. Somebody is going to stick in an emoji and cause an obscure bug.\r\n\r\nThe documented format for Notation, https://wiki.secondlife.com/wiki/LLSD\r\nhas an example where an LSL script is encoded, not as a string, but as binary data with a byte count. See the example which begins\r\n\r\nb(158)"default\r\n{\r\n state_entry()\r\n {\r\n llSay(0, "Hello, Avatar!");\r\n }\r\n...\r\n\r\nThe 158 is a count of bytes, not characters. from what the LL C++ code does. This looks like a legacy representation from before UTF-8, so I suppose we are stuck with that for now. But it is something to be avoided in a text-like format.', 'What were you doing when it happened?': 'Implementing LLSD in Rust for the Sharpview viewer.', 'What were you expecting to happen instead?': "An unambiguous representation for LLSD.\r\n\r\nSuggestions:\r\n\r\n1. Don't try to use the byte-counted forms of Notation LLSD for strings and binary data encapsulated inside XML LLSD. (So far, I haven't seen this done anywhere, but as more glTF features are added it might slip in.) The backslash escaped strings and base 64 binary are fine. So is binary LLSD.\r\n\r\n2. Preferably, don't use Notation LLSD encapsulated inside XML LLSD at all.\r\n\r\n3. Preferably, don't use the byte-counted forms in Notation at all.\r\n\r\n4. Preferably, use UTF-8 for anything that goes over the wire.\r\n\r\nThanks.\r\n\r\n(I know, I'm being picky, but having to write bug-compatible new code requires analysis like this. I still have two in-world prims and three sculpts that Sharpview does not render correctly because users found usefully exploitable bugs in the existing code. LL will face all this with the mobile viewer, I expect.)", } ```
sl-service-account commented 1 year ago

animats commented at 2023-07-22T21:51:53Z

Related forum discussion: https://community.secondlife.com/forums/topic/501635-character-set-for-notation-llsd

sl-service-account commented 1 year ago

Chaser Zaks commented at 2023-07-22T23:14:35Z

I'm confused? Are you saying that LLSD Notation is using character count or byte count? The correct method is byte count. The only thing I can think of happening here is you are reading LLSD notation as UTF-8, in which case, you should assume LLSD Notation as unencoded raw bytes, and parse it from there, decoding as needed. If the actual C++ implementation is counting characters, then my implementation(and assuming many others) are incorrect.

sl-service-account commented 1 year ago

animats commented at 2023-07-23T02:02:19Z

LLSD notation is using byte count for counted fields, per the C++ code. But with Material Overrides (which come in as Long Generic Messages via the event polling system) it's embedded in UTF-8 XML, which cannot reliably contain unencoded bytes. If you put binary inside XML, you're supposed to use base 64 encoding. GLTF material overrides embed Notation inside XML. This works because the content is currently so restricted that no non-ASCII characters appear, and the byte counted encoding forms don't seem to be in use. If that changes, things will start to break.

It's the nesting that causes trouble. Raw LLSD notation can work as a byte stream, but stick it inside XML as a string and you're saying that it's UTF-8 encoded.

sl-service-account commented 1 year ago

Chaser Zaks commented at 2023-07-23T02:19:00Z

oh, huh.. I wonder why it is using two different serializations then? Should just be XML or LLSD Notation, not both.

sl-service-account commented 1 year ago

animats commented at 2023-07-23T03:23:43Z

two different serializations.

Actually three. Inside the Notation is glTF in JSON format. I'd argue for going all UTF-8. We know that's sound. This mixed stuff is going to have corner cases. Possibly a griefing exploit. (Not going to go into details here; don't want this JIRA to go to SEC hidden mode.)

There was some discussion of this at Creator User Group, and at least one of the Lindens wasn't that happy about it. But at that time, it was just about the additional parsing complexity, not a real UTF-8 vs bytes conflict.

Puppetry may have the same problem, because it uses both notation and counts. But I haven't looked at that.

sl-service-account commented 1 year ago

animats commented at 2023-07-23T19:57:00Z, updated at 2023-07-23T19:57:22Z

One backwards-compatible way out of this is two different versions of Notation: Notation (String form), and notation (bytes form).

Notation (bytes form) - used for older uses of Notation as shown in wiki documentation.

Notation (String form) – used for new glTF Material Override info and anything else encapsulated in XML.

sl-service-account commented 1 year ago

Runitai Linden commented at 2023-07-24T18:46:40Z

Good catch.

I like the idea of doing binary LLSD at this point – size optimal and the deserializer on the viewer side doesn't care.  Also surprised that the string isn't already UTF-8, that is definitely an oversight.

 

sl-service-account commented 1 year ago

Runitai Linden commented at 2023-07-24T18:58:51Z

Some context:

What you're seeing is protocol encapsulation.  The "LargeGenericMessage" capability is just a dispatcher of encapsulated data to various functions, and the data that those functions take is opaque to the LargeGenericMessage layer.  The "payload" is going to be a string no matter what and won't be a continuation of the outermost LLSD map.  I do think that binary here makes more sense than notation, but a pure JSON capability would be ideal, and we might go that route when we add support for GLTF Scenes during the next phase of GLTF compatibility development.

  

sl-service-account commented 1 year ago

Runitai Linden commented at 2023-07-24T22:46:58Z

Well, this sucks.

Found where the encapsulation falls off and it's from a copy/paste of another spot that uses this message.  For reasons that remain mysterious to me, the deserializer is explicity using the XML path, which is somehow compatible with becoming notation all-of-a-sudden mid stream, but not with binary.  Looking into encapsulating a binary serialize into the XML as a UTF-8 string.

sl-service-account commented 1 year ago

Runitai Linden commented at 2023-07-25T17:10:12Z

Well, this truly sucks.

Burned some hours on this and got effectively nowhere.  There's probably a secret handshake that gets the simulator to produce a properly escaped binary or notation encoding, but the methods available to the simulator appear to be unreliable for encapsulated JSON.

Tapped a server dev for help, but this is unlikely to get fixed while we're still using LargeGenericMessage and LLSD for sending/receiving encapsulated JSON, and it's too late in the dev cycle to overhaul the message protocol and re test everything.

The list of problems caused by LLSD and the two-data-stream shenanigans that are ObjectUpdate vs LargeGenericMessage and other capabilities is not short.  In GLTF Phase 2, we'll likely set up a pure JSON data stream for sending GLTF data to/from the simulator (maybe even with that newfangled HTTP3!), but for this release, it's going to be this weird mix of LLSD XML/Notation/JSON.

 

sl-service-account commented 1 year ago

animats commented at 2023-07-25T18:54:28Z

Attached "eventdump.txt", which contains JSON, encapsulated inside Notation, encapsulated inside XML LLSD. This was obtained by connecting to "Materials1" on Aditi with Sharpview and logging the region's HTTP "event" stream. There wasn't any visible problem in getting the simulator to send that data.

"Event reply body" is the raw XML encoding of LLSD. All quotes are in " form at this point.

"Agent data item" is the Notation source obtained from the XML LLSD. At this point, quotes in the JSON are preceded by a backslash.

"Parsed notation" is a dump of the tree structure created by the Notation parser in a new version of my public Rust crate "serde-llsd", which is LGPL-licensed and can be found at

https://github.com/John-Nagle/serde-llsd

in the new branch "notation".

"JSON param" is the JSON string extracted from that tree. Quotes are now just quotes.

At this point, I think I have sane parsing logic for all this. I had to implement two "Notation" parsing modes, one from UTF-8 strings (as used by XML and thus XML LLSD) and from byte streams (as used in the wiki example of script uploading). Byte counted strings are only accepted in byte stream mode. I haven't seen any of those in XML LLSD containing notation. As discussed above, that wouldn't work right.

Re: "The list of problems caused by LLSD and the two-data-stream shenanigans that are ObjectUpdate vs LargeGenericMessage and other capabilities is not short." Agreed. I think this can all be made to work, at some cost in complexity. There are many sharp edges involved. Sending byte-counted data over the XML UTF-8 channel is the main non-obvious no-no.

sl-service-account commented 1 year ago

animats commented at 2023-07-25T19:14:15Z

Re: "secret handshake": Getting the server to respond to an event poll requires a POST request. GET won't work. The POST content is supposed to be LLSD XML with a sequence number and a "done" flag, but when I started testing, I just posted the word "TEST" and the server started sending me LLSD XML encoded events, with sequential sequence numbers starting at 1.