ipinfo / mmdbctl

mmdbctl is an MMDB file management CLI supporting various operations on MMDB database files.
Apache License 2.0
111 stars 15 forks source link

Low-level data about MMDB #25

Closed svbatalov closed 3 months ago

svbatalov commented 11 months ago

Hey @UmanShahzad.

To make mmdbctl even more awesome, it would be great to be able to display some low-level data about an MMDB file, such as

This is helpful, for example, if you want to inspect (with hexdump) the actual data section, or if you want to estimate relative impact of the tree/data sections to file size.

Simple example. Let's say we want to find out whether the actual MMDB writer deduplicates written objects (replaces by pointers) or not. I'll use my MMDB parser to display abovementioned offsets.

$ python3 ./parser.py test.mmdb Namespace(file='test.mmdb', meta=False, data=None, ip=None) Data section offset 1096 (data starts at 1112) # <=== Metadata section offset: 1146 (metadata starts at 1160) Data section size 34 bytes (3.4e-05 MB) # <=== Record size: 32 Node count: 137 Tree size: 1096 (bytes) ip_version: 6 First data record at 153 pointer

Knowing the offset/size, we can inspect specific portion of the file:

$ hd -s 1112 -n 34 test.mmdb 00000458 e1 45 76 61 6c 75 65 e1 43 63 6f 6c 47 6e 65 73 |.Evalue.CcolGnes| 00000468 74 65 64 31 e1 20 01 e1 20 08 47 6e 65 73 74 65 |ted1. .. .Gneste| 00000478 64 32 |d2| 0000047a

* Case 2 -- write duplicate objects:
```sh
$ echo -e '{"range":"1.0.0.0/24","value":{"col":"nested1"}}\n{"range":"2.0.0.0/24", "value":{"col":"nested1"}}' | mmdbctl import --no-network -j -o test.mmdb
writing to test.mmdb (2 entries)

$ python3 ./parser.py  test.mmdb
Namespace(file='test.mmdb', meta=False, data=None, ip=None)
Data section offset 1096 (data starts at 1112)  # <===
Metadata section offset: 1132 (metadata starts at 1146)
Data section size 20 bytes (2e-05 MB)   # <===
Record size: 32
Node count: 137
Tree size: 1096 (bytes)
ip_version: 6
First data record at 153 pointer

$ hd -s 1112 -n 20 test.mmdb
00000458  e1 45 76 61 6c 75 65 e1  43 63 6f 6c 47 6e 65 73  |.Evalue.CcolGnes|
00000468  74 65 64 31                                       |ted1|   # Note it removed whole second object and tree points directly to the first one
0000046c

So it does deduplicate objects. Looks like it even deduplicates nested objects, which is great.

The point is it is really convenient to know those offsets when doing stuff like this.

Not sure if Go MMDB reader exposes this data, but it should be easy to find section separators (see the specs) even without parsing the file, e.g. by mmap-ing the file and using string search functions: https://github.com/svbatalov/construct_mmdb_parser/blob/11b13ef946b7d85cec4e21a538af49b5b44f22a1/parser.py#L13-L19

Thanks, Sergey

UmanShahzad commented 11 months ago

Great feedback and thanks for those feature requests @svbatalov !

The data's definitely gonna be available within the MMDB library, will check if it's exposed or not, and we could try to get a PR merged to expose it if not and/or temporarily use a fork.

We can add this data to the mmdbctl metadata output - is that the ideal place to expose it for you @svbatalov ?

cc @coderholic

svbatalov commented 11 months ago

@UmanShahzad Yeah, sounds great!

max-ipinfo commented 3 months ago

The metadata has been included. Closing issue:

$ mmdbctl metadata ip_geolocation_sample.mmdb 
- Binary Format 2.0 
- Database Type ipinfo ip_geolocation_sample.mmdb 
- IP Version    6 
- Record Size   32 
- Node Count    2927 (2.86 KB)
- Tree Size     23416 (22.87 KB)
- Data Section Size 10790 (10.54 KB)
- Data Section Start Offset 23432 (22.88 KB)
- Data Section End Offset 34222 (33.42 KB)
- Metadata Section Start Offset 34236 (33.43 KB)
- Description    
    en ipinfo ip_geolocation_sample.mmdb
- Languages     en 
- Build Epoch   1722965173