wader / fq

jq for binary formats - tool, language and decoders for working with binary and text formats
Other
9.57k stars 219 forks source link

leveldb: Add LevelDB support #824

Closed mikez closed 7 months ago

mikez commented 7 months ago

@wader Thank you for the comments. Now also read dev.md and made updates accordingly.

mikez commented 7 months ago

@wader Also a question: what's your common way to hide "compressed" chunks when they're also decompressed?

Background: I added a d.FieldRawLen("compressed",…) to be consistent with other formats. However, when examining files, this can be distracting to have that compressed property show up as well. What's your default way to allow to hide the compressed parts if we're already showing the decompressed parts?

wader commented 7 months ago

@wader Also a question: what's your common way to hide "compressed" chunks when they're also decompressed?

Background: I added a d.FieldRawLen("compressed",…) to be consistent with other formats. However, when examining files, this can be distracting to have that compressed property show up as well. What's your default way to allow to hide the compressed parts if we're already showing the decompressed parts?

I usually lean towards not hide things as that is kind of what fq is all about. In this case if you did not add "compressed" fields would those bit ranges show as gaps fields instead?

wader commented 7 months ago

The failing test is about TestFormats/all/all.fqtest, probably just a matter of WRITE_ACTUAL=1 go test -run TestFormats/all/all.fqtest ./format. I've thought about redoing those tests... would be nice if adding a format would affect as little as possible outside the formats own directory

mikez commented 7 months ago

@wader Also a question: what's your common way to hide "compressed" chunks when they're also decompressed? Background: I added a d.FieldRawLen("compressed",…) to be consistent with other formats. However, when examining files, this can be distracting to have that compressed property show up as well. What's your default way to allow to hide the compressed parts if we're already showing the decompressed parts?

I usually lean towards not hide things as that is kind of what fq is all about. In this case if you did not add "compressed" fields would those bit ranges show as gaps fields instead?

I'd like you to hear me differently.

Say there is a chunk of data that is compressed with Snappy. I uncompress this data and show the uncompressed data-structure under the key "uncompressed". However, in line with other formats, I also include a key called "compressed" which has the original raw Snappy-compressed bits. When there are a lot of compressed sections like these (which have all been uncompressed and have a corresponding "uncompressed" key) it can get quite unwieldy to skim the output. Therefore the inquiry into if it makes sense to hide these "compressed" sections (or more forcibly truncating them in the preview) if there are already corresponding "uncompressed" sections.

wader commented 7 months ago

I'd like you to hear me differently.

Say there is a chunk of data that is compressed with Snappy. I uncompress this data and show the uncompressed data-structure under the key "uncompressed". However, in line with other formats, I also include a key called "compressed" which has the original raw Snappy-compressed bits. When there are a lot of compressed sections like these (which have all been uncompressed and have a corresponding "uncompressed" key) it can get quite unwieldy to skim the output. Therefore the inquiry into if it makes sense to hide these "compressed" sections (or more forcibly truncating them in the preview) if there are already corresponding "uncompressed" sections.

I see. As it works currently if you didn't add "compressed" fields they would instead show up as "gap" fields that fq automatically inserts, this is so that all bits will always be "reachable" somehow. Maybe what you looking for is the ability to add a "compressed" but somehow tell fq that it should be displayed in a more discreet way unless is verbose mode etc? I'm thinking if totally hiding a field might be confusing as it will look as there is a gap in data/hex column? or maybe i misunderstand what your aiming for?

mikez commented 7 months ago

@wader Question about the LevelDB log format and how to decompress it.

In essence the data structure is as follows:

  1. We have a sequence of records, each with some data. The user can iterate over these using the LevelDB API.
  2. In the .log-file itself, these records are split into many small pieces and put into blocks of 32KB. Each little piece has a marker if it's an entire record ("full") or only a fragment ("first", "middle", "last").
  3. In the LevelDB app itself, these small little pieces are put together by and the datastructure is parsed.

I hope that's clear so far.

Thus my question is: for the pieces which are preserved in full, it's easy to show its underlying datastructure, since it hasn't been split. However, for the "first", "middle", and "last" pieces, I wouldn't know how to visualize it? Is there some precedence here in the other formats?

mikez commented 7 months ago

@wader Something about those test failures I don't quite understand; it seems some aspect of LevelDB is leaking into the other tests involving FieldFormatBitBuf.

wader commented 7 months ago

@wader Question about the LevelDB log format and how to decompress it.

In essence the data structure is as follows:

  1. We have a sequence of records, each with some data. The user can iterate over these using the LevelDB API.
  2. In the .log-file itself, these records are split into many small pieces and put into blocks of 32KB. Each little piece has a marker if it's an entire record ("full") or only a fragment ("first", "middle", "last").
  3. In the LevelDB app itself, these small little pieces are put together by and the datastructure is parsed.

I hope that's clear so far.

Thus my question is: for the pieces which are preserved in full, it's easy to show its underlying datastructure, since it hasn't been split. However, for the "first", "middle", and "last" pieces, I wouldn't know how to visualize it? Is there some precedence here in the other formats?

Think i would try to be "true" to how the format works, so show the parts as some array with structs with raw data field for the partia data etc. This is how most formats in fq work that has to "demux" etc, ex ogg, gzip and the tcp reassembly. This also makes it quite nice to when deal with broken files that might have some partial data that is good, then you can use fq to concat parts, maybe prefix/append some missing data and decode or output the "repaired" data.

wader commented 7 months ago

@wader Something about those test failures I don't quite understand; it seems some aspect of LevelDB is leaking into the other tests involving FieldFormatBitBuf.

I think it might be the thing i mentioned about probe, that the leveldb_description format succeeds when it shouldn't. So i guess this will get fixed when you remove it from probe.

mikez commented 7 months ago

I'd like you to hear me differently. Say there is a chunk of data that is compressed with Snappy. I uncompress this data and show the uncompressed data-structure under the key "uncompressed". However, in line with other formats, I also include a key called "compressed" which has the original raw Snappy-compressed bits. When there are a lot of compressed sections like these (which have all been uncompressed and have a corresponding "uncompressed" key) it can get quite unwieldy to skim the output. Therefore the inquiry into if it makes sense to hide these "compressed" sections (or more forcibly truncating them in the preview) if there are already corresponding "uncompressed" sections.

I see. As it works currently if you didn't add "compressed" fields they would instead show up as "gap" fields that fq automatically inserts, this is so that all bits will always be "reachable" somehow. Maybe what you looking for is the ability to add a "compressed" but somehow tell fq that it should be displayed in a more discreet way unless is verbose mode etc? I'm thinking if totally hiding a field might be confusing as it will look as there is a gap in data/hex column? or maybe i misunderstand what your aiming for?

Yes, I hear you. I think I found a solution... to use d instead of dd. That nicely truncates the compressed parts into one line. Some of the regular strings might be truncated as well like that, but I could look them up manually as needed.

wader commented 7 months ago

Yes, I hear you. I think I found a solution... to use d instead of dd. That nicely truncates the compressed parts into one line. Some of the regular strings might be truncated as well like that, but I could look them up manually as needed.

Aha that explains things :) i wonder if we could have separate options for raw and string truncate limit? by look up manually you mean use a query to access a string field?

mikez commented 7 months ago

Yes, I hear you. I think I found a solution... to use d instead of dd. That nicely truncates the compressed parts into one line. Some of the regular strings might be truncated as well like that, but I could look them up manually as needed.

Aha that explains things :) i wonder if we could have separate options for raw and string truncate limit? by look up manually you mean use a query to access a string field?

use a query, exactly! In my imaginary world, the entire thing would be an App like Hex Fiend or some debugger, and I could just hover over it to see the full thing as a tooltip.

wader commented 7 months ago

use a query, exactly! In my imaginary world, the entire thing would be an App like Hex Fiend or some debugger, and I could just hover over it to see the full thing as a tooltip.

Yeap that would very interesting. I can imagine some kind of "IDE" with multiple visual tree:s, data views and REPL:s.

wader commented 7 months ago

Played around a bit with the leveldb table decoder locally and noticed this, looks a bit fishy?

➜  fq git:(leveldb) ✗ go run . -o line_bytes=8 -d leveldb_table '.metaindex | d' ~/Library/Application\ Support/Spotify/PersistentCache/public.ldb/000293.ldb
      │00 01 02 03 04 05 06 07│01234567│.metaindex{}:
      │                       │        │  uncompressed{}:
      │                       │        │    entries[0:1]:
      │                       │        │      [0]{}: entry
0x7a58│         00            │   .    │        shared_bytes: 0
0x7a58│            22         │    "   │        unshared_bytes: 34
0x7a58│               05      │     .  │        value_length: 5
      │                       │        │        internal_key{}:
0x7a58│                  66 69│      fi│          user_key: "filter.leveldb.BuiltinBloo"
0x7a60│6c 74 65 72 2e 6c 65 76│lter.lev│
0x7a68│65 6c 64 62 2e 42 75 69│eldb.Bui│
0x7a70│6c 74 69 6e 42 6c 6f 6f│ltinBloo│
0x7a78│6d                     │m       │          type: 0x6d
0x7a78│   46 69 6c 74 65 72 32│ Filter2│          sequence_number: 14199528906058054
      │                       │        │        value{}:
0x7a80│97 f2 01               │...     │          offset: 30999
0x7a80│         bf 02         │   ..   │          size: 319
      │                       │        │    trailer{}:
      │                       │        │      restarts[0:1]:
0x7a80│               00 00 00│     ...│        [0]: 0
0x7a88│00                     │.       │
0x7a88│   01 00 00 00         │ ....   │      num_restarts: 1
0x7a88│               00      │     .  │  compression: "none" (0x0)
0x7a88│                  71 a3│      q.│  checksum: 0x51caa371 (valid)
0x7a90│ca 51                  │.Q      │

Also noticed some strings seems to be length prefixed user_key: "\x11PlaylistSyncEvent#" but maybe that is just how the application saves it's data?

mikez commented 7 months ago

Played around a bit with the leveldb table decoder locally and noticed this, looks a bit fishy?

➜  fq git:(leveldb) ✗ go run . -o line_bytes=8 -d leveldb_table '.metaindex | d' ~/Library/Application\ Support/Spotify/PersistentCache/public.ldb/000293.ldb
2023/12/07 15:04:17 bla1
2023/12/07 15:04:17 bla1
2023/12/07 15:04:17 bla1
      │00 01 02 03 04 05 06 07│01234567│.metaindex{}:
      │                       │        │  uncompressed{}:
      │                       │        │    entries[0:1]:
      │                       │        │      [0]{}: entry
0x7a58│         00            │   .    │        shared_bytes: 0
0x7a58│            22         │    "   │        unshared_bytes: 34
0x7a58│               05      │     .  │        value_length: 5
      │                       │        │        internal_key{}:
0x7a58│                  66 69│      fi│          user_key: "filter.leveldb.BuiltinBloo"
0x7a60│6c 74 65 72 2e 6c 65 76│lter.lev│
0x7a68│65 6c 64 62 2e 42 75 69│eldb.Bui│
0x7a70│6c 74 69 6e 42 6c 6f 6f│ltinBloo│
0x7a78│6d                     │m       │          type: 0x6d
0x7a78│   46 69 6c 74 65 72 32│ Filter2│          sequence_number: 14199528906058054
      │                       │        │        value{}:
0x7a80│97 f2 01               │...     │          offset: 30999
0x7a80│         bf 02         │   ..   │          size: 319
      │                       │        │    trailer{}:
      │                       │        │      restarts[0:1]:
0x7a80│               00 00 00│     ...│        [0]: 0
0x7a88│00                     │.       │
0x7a88│   01 00 00 00         │ ....   │      num_restarts: 1
0x7a88│               00      │     .  │  compression: "none" (0x0)
0x7a88│                  71 a3│      q.│  checksum: 0x51caa371 (valid)
0x7a90│ca 51                  │.Q      │

Also noticed some strings seems to be length prefixed user_key: "\x11PlaylistSyncEvent#" but maybe that is just how the application saves it's data?

Great find! Two things:

  1. I need to check out the metaindex again. I falsely supposed it followed the same format as index, but it clearly doesn't.
  2. Yes, to "save space" common prefixes are truncated. The "unshared" and "shared" properties keep track of the bytes that may need to get prefixed to keys to form the entire key. In the view, I show what is actually stored in the database, not what the app would eventually generate.
wader commented 7 months ago

Great find! Two things:

  1. I need to check out the metaindex again. I falsely supposed it followed the same format as index, but it clearly doesn't.
  2. Yes, to "save space" common prefixes are truncated. The "unshared" and "shared" properties keep track of the bytes that may need to get prefixed to keys to form the entire key. In the view, I show what is actually stored in the database, not what the app would eventually generate.

I see, if you want and if it's handy for a user you can add "synthetic" fields that have the "unshares" strings somehow d.FieldValue<type>("...", ...). Synthetic fields are fields that are not backed by any bitrange so they are just jq values. Can be quite handy for extra annotation or maybe derived values that are common for the format.

mikez commented 7 months ago

d.FieldValue

I tried to search for "synthetic field" in the code, but couldn't find any examples, except for the flag in scalar.go. How would I implement this?

wader commented 7 months ago

d.FieldValue

I tried to search for "synthetic field" in the code, but couldn't find any examples, except for the flag in scalar.go. How would I implement this?

Hmm strange, but it was quite recently that i made synthetic something more special, before they were a hack using zero length bit ranges so it might explain why "synthetic" is not used much in the code yet.

For example mp4 decoder adds track id and format here https://github.com/wader/fq/blob/master/format/mp4/mp4.go#L280-L291

wader commented 7 months ago

Stress tested a bit more using:

find ~/ -name "*.ldb" -print0 | xargs -0 go run . -d leveldb_table '._error | select(.) | input_filename, .error'

Seems chrome has a bunch of leveldb files, some fail with variations of "UTF8(user_key): failed at position 127 (read size 0 seek pos 0): tryText nBytes must be >= 0 (-1)

mikez commented 7 months ago

Stress tested a bit more using:

find ~/ -name "*.ldb" -print0 | xargs -0 go run . -d leveldb_table '._error | select(.) | input_filename, .error'

Seems chrome has a bunch of leveldb files, some fail with variations of "UTF8(user_key): failed at position 127 (read size 0 seek pos 0): tryText nBytes must be >= 0 (-1)

I appreciate the stress-testing and the command. I'll take a look tomorrow!

wader commented 7 months ago

I appreciate the stress-testing and the command. I'll take a look tomorrow!

👍 no stress! i think the PR is in very good shape now, some more testing and fix decode issues and we should be ready to merge. anything else you want to add?

mikez commented 7 months ago

@wader That one took me a while to track down and fix. :)

mikez commented 7 months ago

d.FieldValue

I tried to search for "synthetic field" in the code, but couldn't find any examples, except for the flag in scalar.go. How would I implement this?

Hmm strange, but it was quite recently that i made synthetic something more special, before they were a hack using zero length bit ranges so it might explain why "synthetic" is not used much in the code yet.

For example mp4 decoder adds track id and format here https://github.com/wader/fq/blob/master/format/mp4/mp4.go#L280-L291

Thank you. I searched for "synthetic"; had I searched for "FieldValue", then I would have found it faster. I wrongly assumed FieldValue was just a regular reader. :)

wader commented 7 months ago

Thank you. I searched for "synthetic"; had I searched for "FieldValue", then I would have found it faster. I wrongly assumed FieldValue was just a regular reader. :)

Aha sorry should have been clearer :)

BTW now all *.ldb files in my home directory decodes fine and looks beautiful!

wader commented 7 months ago

Great work! just some tiny things left i think, after that just let me know if you think your read to merge

Hope it was a good review experience too 😄

mikez commented 7 months ago

Great work! just some tiny things left i think, after that just let me know if you think your read to merge

Hope it was a good review experience too 😄

Thank you, @wader! I very much appreciate your reviewing style. I learned a lot about fq, Go, and LevelDB in the process. I hope it was a good review experience for you as well.

wader commented 7 months ago

Great work! just some tiny things left i think, after that just let me know if you think your read to merge Hope it was a good review experience too 😄

Thank you, @wader! I very much appreciate your reviewing style. I learned a lot about fq, Go, and LevelDB in the process. I hope it was a good review experience for you as well.

Happy to hear that! i can sometimes end up thinking that i might come a cross as "naggy" when there is a lot of back and forth, i'm not! superhappy someone wants to help out :) also a bit tricky to review things about the format itself as i have very little experience with it, trying to mostly focus on helping it fit well into rest of fq :)

wader commented 7 months ago

Played around a bit more with chrome leveldb:s, seems to work fine! is a bit confusing to look at web storage databases, lots of weird stuff stored in them :)

Think i'm kind of ready to merge if you are

mikez commented 7 months ago

@wader I'm ready! :)

wader commented 7 months ago

🥳