pawitp / protobuf-decoder

JavaScript-based web UI to decode ad-hoc Protobuf data
https://protobuf-decoder.netlify.app/
MIT License
392 stars 91 forks source link

String mis-decoded as Protobuf #5

Open pawitp opened 5 years ago

pawitp commented 5 years ago

The string "123-456-789" is mistakenly decoded as an inner protobuf message. We should provide a way for users to override the decode format.

konsumer commented 5 months ago

Yeh, I found this in my own libs, too (see notes on #81 ) One idea I had on protoquery was a sort of "choices" interface. Like if there was a select-box on every branch to say "this should be parsed as a string" or "this FIXED32 is a float" you could get better output, because you can say "this LEN field is really a string, even though it's possible to parse as a sub-message" it also works for packed repeat fields (see #2 ) Here is the format I imagine for choices:

{
  "1.2.4": "string"
}

Essentially a flat JSON of associative-array that has full path and expected type. This way we could load a choices object before parsing, and use it for display, and I think it could also help with generating JSON and even proto SDL (using my rawproto lib.) I was thinking about doing my own seperate UI project that uses these ideas, but I would rather merge with yours, so we can share some of this work. If we could get this in the UI here, then we could use the "choices" to generate JSON and proto SDL that is pretty accurate.

I propose these as valid return-types:

VARINT - var
FIXED64 - i64, u64, double
LEN - string, bytes, packedvar, packed32, packed64
FIXED32 - i32, u32, bool, f32

Along with raw which is the full raw object (with parts and leftOver)

Which are what I support in protoquery. I put some work in to support your raw format, and I have other things implemented there, like parsing packed-repeat fields & groups.

I was also thinking that the types might be able to be simplified a little:

VARINT - int
FIXED64 - int, uint, float
LEN - string, bytes, packedvar, packed32, packed64
FIXED32 - int, uint, bool, float

So there are less selection-types (easier for user) but it will figure out how to parse it based on type + wireType (so FIXED64 + "float" is "double" from above.)

It might even be possible to offer higher-level choices like common google proto messages (DateTime for example) at the LEN-level, if it could match (like all the sub-fields match a signature we know about.)

Another thing to consider is groups. I currently treat GROUPSTART messages like repeat messages, and pull all the bytes (up to and including GROUPEND) in protoquuery, so I can do this:

const medias = query(appTree, '10:bytes').map(i => {
  const t = getTree(i)
  return {
    type: query(t, '1:var').pop(),
    url: query(t, '5:string').pop()
  }
})
expect(medias.length).toEqual(10)
const icon = medias.find(m => m.type === 4).url
const screenshots = medias.filter(m => m.type === 1).map(m => m.url)
const videos = medias.filter(m => m.type === 3).map(m => m.url)
const videoThumbs = medias.filter(m => m.type === 13).map(m => m.url)

and get this:

{
  "icon": "https://play-lh.googleusercontent.com/qTt7JkhZ-U0kevENyTChyUijNUEctA3T5fh7cm8yzKUG0UAnMUgOMpG_9Ln7D24NbQ",
  "screenshots": [
    "https://play-lh.googleusercontent.com/m-S0SqOv428DZcm46NJlyv0pffYpfsNjWz6iyf9LVM1TCWbzWs3clWaugjfzXXnCTbY",
    "https://play-lh.googleusercontent.com/WYyQPx2lhnCQYueR8q99NhIsxzJZX4NIHdlf0UxoPiwrHg1WlYGiFTy51F1aBZxaslE",
    "https://play-lh.googleusercontent.com/ioUSrp79Yx_rZikYRI_XtK_mdP5kavcm9oWDWYrgVqXfS4bpW46y41lbGTQJT7tPn7ej",
    "https://play-lh.googleusercontent.com/-8X4UtNoQm8k7cX7q5jT2fQmFFQbSJnreoxensBZQ7kuIs9d8FWwtyxM3QvqKd4SdBOA",
    "https://play-lh.googleusercontent.com/_08ljE8yjFzGuroCP-6cemh2LIrQ_zptq5K4KLahsqAsCq5cjLSU-ghzkHhf2_0UZ9c",
    "https://play-lh.googleusercontent.com/ofJcMSH2oJKzax3lscP-IJCZecGnkt7XWpt7L1J7KVhridE7wejHWqhrl4oV000sty1E"
  ],
  "videos": [
    "https://youtu.be/XT7YEb9_Muw"
  ],
  "videoThumbs": [
    "https://i.ytimg.com/vi/XT7YEb9_Muw/hqdefault.jpg"
  ]
}

It works, but it's a bit awkward to have to pull the sub-messages first (10:bytes) then parse each as a tree. I would rather it worked more like this:

const mediaTypes = query(appTree, '10.1:var')
const mediaUrls = query(appTree, '10.5:string')

but I kept having issues with getting it to query correctly. The other method works, but I think it's a still bit clunky, and will keep trying to improve it. I am tracking it as an issue at protoquery#1

pawitp commented 5 months ago

Yes, having a select dropdown box to select the decoding type is what I imagined when I wrote this issue. It would be smart though and:

  1. Only show dropdown if there are multiple ways to decode the field
  2. In the dropdown, only valid choices are displayed

The dropdown would also be needed for numbers which can be decoded multiple ways, but we are simply displaying all choices at the moment because of lack of dropdown support.

I'm not sure what you mean by GROUPSTART/GROUPEND.

konsumer commented 5 months ago

I'm not sure what you mean by GROUPSTART/GROUPEND.

Wiretype 3 & 4. Here they call it SGROUP and EGROUP.

These are deprecated in proto3, but still show up a lot in old proto (like when parsing google play or other proto2 traffic.) You can see it in my protoquery unit-tests (hearthstone is google play detail record.) The example above is a group that does not parse in yours, but works in mine.

Working with groups is pretty simple, essentially you just grab bytes from between SGROUP and EGROUP and parse as a repeat message. It's always messages, so you cannot directly have other LEN types (packed repeated or strings/bytes) at top, but they can be part of the repeated sub-message. It's the only wire-type that has a start & end byte-marker.

I have decided to work on my own separate rawproto UI tool, since I think we have different ideas about how it should fit together, tooling, and how the actual UI should work, and I have already made libs for doing things like parsing the binary tree initially, outputting JSON or proto SDL, and doing queries, that I am anxious to merge into rawproto, along with my own ideas for a UI (lazy message-branch eval, choices, more field-types, etc.) I wrote a book that uses some of my libs, and even though it's been a little while (5 years or so) I constantly have ideas for making the tooling better.

Good luck!

pawitp commented 5 months ago

Good luck with your project!