svbatalov commented 7 months ago

Hey @UmanShahzad.

To make mmdbctl even more awesome, it would be great to be able to display some low-level data about an MMDB file, such as

Tree size in bytes
Data section start/end offsets
Data section size in bytes
Metadata section start offset

This is helpful, for example, if you want to inspect (with hexdump) the actual data section, or if you want to estimate relative impact of the tree/data sections to file size.

Simple example. Let's say we want to find out whether the actual MMDB writer deduplicates written objects (replaces by pointers) or not. I'll use my MMDB parser to display abovementioned offsets.

Case 1 -- write two different objects


$ echo -e '{"range":"1.0.0.0/24","value":{"col":"nested1"}}\n{"range":"2.0.0.0/24", "value":{"col":"nested2"}}' | mmdbctl import --no-network -j -o test.mmdb
writing to test.mmdb (2 entries)

$ python3 ./parser.py test.mmdb Namespace(file='test.mmdb', meta=False, data=None, ip=None) Data section offset 1096 (data starts at 1112) # <=== Metadata section offset: 1146 (metadata starts at 1160) Data section size 34 bytes (3.4e-05 MB) # <=== Record size: 32 Node count: 137 Tree size: 1096 (bytes) ip_version: 6 First data record at 153 pointer

Knowing the offset/size, we can inspect specific portion of the file:

$ hd -s 1112 -n 34 test.mmdb 00000458 e1 45 76 61 6c 75 65 e1 43 63 6f 6c 47 6e 65 73 |.Evalue.CcolGnes| 00000468 74 65 64 31 e1 20 01 e1 20 08 47 6e 65 73 74 65 |ted1. .. .Gneste| 00000478 64 32 |d2| 0000047a

* Case 2 -- write duplicate objects:
```sh
$ echo -e '{"range":"1.0.0.0/24","value":{"col":"nested1"}}\n{"range":"2.0.0.0/24", "value":{"col":"nested1"}}' | mmdbctl import --no-network -j -o test.mmdb
writing to test.mmdb (2 entries)

$ python3 ./parser.py  test.mmdb
Namespace(file='test.mmdb', meta=False, data=None, ip=None)
Data section offset 1096 (data starts at 1112)  # <===
Metadata section offset: 1132 (metadata starts at 1146)
Data section size 20 bytes (2e-05 MB)   # <===
Record size: 32
Node count: 137
Tree size: 1096 (bytes)
ip_version: 6
First data record at 153 pointer

$ hd -s 1112 -n 20 test.mmdb
00000458  e1 45 76 61 6c 75 65 e1  43 63 6f 6c 47 6e 65 73  |.Evalue.CcolGnes|
00000468  74 65 64 31                                       |ted1|   # Note it removed whole second object and tree points directly to the first one
0000046c

So it does deduplicate objects. Looks like it even deduplicates nested objects, which is great.

The point is it is really convenient to know those offsets when doing stuff like this.

Not sure if Go MMDB reader exposes this data, but it should be easy to find section separators (see the specs) even without parsing the file, e.g. by mmap-ing the file and using string search functions: https://github.com/svbatalov/construct_mmdb_parser/blob/11b13ef946b7d85cec4e21a538af49b5b44f22a1/parser.py#L13-L19

Thanks, Sergey

UmanShahzad commented 7 months ago

Great feedback and thanks for those feature requests @svbatalov !

The data's definitely gonna be available within the MMDB library, will check if it's exposed or not, and we could try to get a PR merged to expose it if not and/or temporarily use a fork.

We can add this data to the mmdbctl metadata output - is that the ideal place to expose it for you @svbatalov ?

cc @coderholic

svbatalov commented 7 months ago

@UmanShahzad Yeah, sounds great!

ipinfo / mmdbctl

Low-level data about MMDB #25

Knowing the offset/size, we can inspect specific portion of the file: