ipinfo / mmdbctl

mmdbctl is an MMDB file management CLI supporting various operations on MMDB database files.
Apache License 2.0
111 stars 15 forks source link

file size difference #23

Closed zuozhehao closed 10 months ago

zuozhehao commented 1 year ago

GeoLite2-City.mmdb(64.2 MB) export to city.json(4.53 GB) city.json(4.53 GB) import city.mmdb(287 MB)

Why is the re exported file(city.mmdb(287 MB)) much larger than the original(GeoLite2-City.mmdb(64.2 MB))

Am I missing any parameters.

UmanShahzad commented 1 year ago

@zuozhehao what commands were run and can you provide the 64.2MB file if possible for debugging?

But it's likely that some extra fields get added in - is the output of a read from both MMDBs exactly the same for most IPs? I'm thinking you've likely got a new network field added into the 287MB one, which can be fixed with --no-network.

zuozhehao commented 1 year ago

@zuozhehao what commands were run and can you provide the 64.2MB file if possible for debugging?

But it's likely that some extra fields get added in - is the output of a read from both MMDBs exactly the same for most IPs? I'm thinking you've likely got a new network field added into the 287MB one, which can be fixed with --no-network.

@UmanShahzad I did not modify the data, tested export to JSON, and then import it from JSONto mmdb.

1、GeoLite2-City.mmdb(64.2 MB) download from https://www.maxmind.com/ 2、commands: ./mmdbctl export --format json GeoLite2-City.mmdb city.json ./mmdbctl import --json --in city.json --out city.mmdb

zuozhehao commented 1 year ago

with --no-network ,it will out file size city.mmdb(69.5 MB) What is the extra 5MB..

UmanShahzad commented 1 year ago

@zuozhehao can you show an example output with mmdbctl read <ip> --json-pretty <mmdb> for each? And also mmdbctl metadata?

zuozhehao commented 1 year ago

@zuozhehao can you show an example output with mmdbctl read <ip> --json-pretty <mmdb> for each? And also mmdbctl metadata?

@UmanShahzad This is using the exported new database ./mmdbctl read 8.8.8.8 --format json-pretty city.mmdb

{
  "continent": {
    "code": "NA",
    "geoname_id": 6255149,
    "names": {
      "de": "Nordamerika",
      "en": "North America",
      "es": "Norteamérica",
      "fr": "Amérique du Nord",
      "ja": "北アメリカ",
      "pt-BR": "América do Norte",
      "ru": "Северная Америка",
      "zh-CN": "北美洲"
    }
  },
  "country": {
    "geoname_id": 6252001,
    "iso_code": "US",
    "names": {
      "de": "Vereinigte Staaten",
      "en": "United States",
      "es": "Estados Unidos",
      "fr": "États Unis",
      "ja": "アメリカ",
      "pt-BR": "EUA",
      "ru": "США",
      "zh-CN": "美国"
    }
  },
  "location": {
    "accuracy_radius": 1000,
    "latitude": 37.751,
    "longitude": -97.822,
    "time_zone": "America/Chicago"
  },
  "registered_country": {
    "geoname_id": 6252001,
    "iso_code": "US",
    "names": {
      "de": "Vereinigte Staaten",
      "en": "United States",
      "es": "Estados Unidos",
      "fr": "États Unis",
      "ja": "アメリカ",
      "pt-BR": "EUA",
      "ru": "США",
      "zh-CN": "美国"
    }
  }
}

./mmdbctl read 8.8.8.8 --format json-pretty GeoLite2-City.mmdb

{
  "continent": {
    "code": "NA",
    "geoname_id": 6255149,
    "names": {
      "de": "Nordamerika",
      "en": "North America",
      "es": "Norteamérica",
      "fr": "Amérique du Nord",
      "ja": "北アメリカ",
      "pt-BR": "América do Norte",
      "ru": "Северная Америка",
      "zh-CN": "北美洲"
    }
  },
  "country": {
    "geoname_id": 6252001,
    "iso_code": "US",
    "names": {
      "de": "Vereinigte Staaten",
      "en": "United States",
      "es": "Estados Unidos",
      "fr": "États Unis",
      "ja": "アメリカ",
      "pt-BR": "EUA",
      "ru": "США",
      "zh-CN": "美国"
    }
  },
  "location": {
    "accuracy_radius": 1000,
    "latitude": 37.751,
    "longitude": -97.822,
    "time_zone": "America/Chicago"
  },
  "registered_country": {
    "geoname_id": 6252001,
    "iso_code": "US",
    "names": {
      "de": "Vereinigte Staaten",
      "en": "United States",
      "es": "Estados Unidos",
      "fr": "États Unis",
      "ja": "アメリカ",
      "pt-BR": "EUA",
      "ru": "США",
      "zh-CN": "美国"
    }
  }
}
UmanShahzad commented 1 year ago

@zuozhehao could you pls provide either the source MMDB file or show the metadata result on both MMDBs? It seems I have to signup and go through a procedure to procure it myself.

The extra 5MB can be occurring for many reasons, but likely it's due to differences in how the MMDB writers (the golang one we use, and the one used by maxmind to produce their MMDB that you downloaded originally) optimize the data section.

zuozhehao commented 1 year ago

@UmanShahzad The file volume exceeds the limit, I uploaded it to the online drive send.cm GEOLITE2-CITY.MMDB drive.google.com GEOLITE2-CITY.MMDB

UmanShahzad commented 1 year ago
# mmdb -> json -> mmdb
$ ipinfo mmdb export --format json GeoLite2-City.mmdb city.json
$ ipinfo mmdb import --json --no-network --size 28 --in city.json --out city.mmdb

$ ipinfo mmdb metadata city.mmdb
- Binary Format 2.0
- Database Type ipinfo city.mmdb
- IP Version    6
- Record Size   28
- Node Count    4755530
- Description
    en ipinfo city.mmdb
- Languages     en
- Build Epoch   1701062232

$ ipinfo mmdb metadata GeoLite2-City.mmdb
- Binary Format 2.0
- Database Type GeoLite2-City
- IP Version    6
- Record Size   28
- Node Count    4755629
- Description
    en GeoLite2City database
- Languages     de, en, es, fr, ja, pt-BR, ru, zh-CN
- Build Epoch   1700588285

Another attempt with --ignore-empty-values --disallow-reserved:

$ ipinfo mmdb import --json --no-network --size 28 --ignore-empty-values --disallow-reserved --in city.json --out city2.mmdb

$ ipinfo mmdb metadata city2.mmdb
- Binary Format 2.0
- Database Type ipinfo city2.mmdb
- IP Version    6
- Record Size   28
- Node Count    4755615
- Description
    en ipinfo city2.mmdb
- Languages     en
- Build Epoch   1701062800

$ ipinfo mmdb metadata GeoLite2-City.mmdb
- Binary Format 2.0
- Database Type GeoLite2-City
- IP Version    6
- Record Size   28
- Node Count    4755629
- Description
    en GeoLite2City database
- Languages     de, en, es, fr, ja, pt-BR, ru, zh-CN
- Build Epoch   1700588285

Now adding in --alias-6to4:

$ ipinfo mmdb import --json --no-network --size 28 --ignore-empty-values --disallow-reserved --alias-6to4 --in city.json --out city3.mmdb

$ ipinfo mmdb metadata city3.mmdb
- Binary Format 2.0
- Database Type ipinfo city3.mmdb
- IP Version    6
- Record Size   28
- Node Count    4755630
- Description
    en ipinfo city3.mmdb
- Languages     en
- Build Epoch   1701063705

$ ipinfo mmdb metadata GeoLite2-City.mmdb
- Binary Format 2.0
- Database Type GeoLite2-City
- IP Version    6
- Record Size   28
- Node Count    4755629
- Description
    en GeoLite2City database
- Languages     de, en, es, fr, ja, pt-BR, ru, zh-CN
- Build Epoch   1700588285

It seems no matter what flag combos I use, the size doesn't really budge. The Golang MMDB writer which is used by mmdbctl may just not be as efficient in deduplicating the data properly as compared to the writer being used to produce the input MMDB.