svaningelgem / location_ipfire_db_reader

MIT License
0 stars 0 forks source link

Not an issue but can you take a look? #6

Closed MaurUppi closed 4 months ago

MaurUppi commented 8 months ago

Given you had written a very handful library to read IPFires's location.db. Therefore, I used the function and leveraged another HTTP API as two references for cross-checking source data.

The source data to be checked is dbip-country-lite-2024-01.csv The references are IPFire's location.db and ip-api.com batch inquiry API. (100 IPs each request, 15s intervals )

Long story short, here is the result.

Total number of IPv4 records in the data source: 308422
Sampling size: 384
Processed 384/384 IPs
Accuracy (Cross-checked w/IPFire location.db): 60.677083333333336%
Processed 100/384 IPs
Processed 200/384 IPs
Processed 300/384 IPs
Processed 384/384 IPs
Accuracy (Cross-checked w/IP-API.com HTTP-API): 88.54166666666666%

Assuming two references have a high credit/confidence/accuracy level of CountryCode corresponding to CIDR. The initial conclusion could be:

  1. the IP-API.com data looks highly aligned with dbip-country db.
  2. in contrast, location.db only had 40% same records as dbip-country db.

so, one of the references is lier?

The Python code is below, If you have free moments, take a look to check if there are any logical faults.

Details

```python import math import time import pandas as pd import random import ipaddress import os from location_ipfire_db_reader import LocationDatabase from location_ipfire_db_reader.exceptions import IPAddressError import requests source_file_path = os.getenv('SOURCE_FILE_PATH', 'D:\\Dev\\GeoLocationData\\GeoDB_merger\\source\\dbip-country-lite-2024-01.csv') log_file_path_ipfire = os.getenv('LOG_FILE_PATH_IPFIRE', 'D:\\Dev\\GeoLocationData\\GeoDB_merger\\log\\random_pickIPaddr_checkwith_IPFireDB.log') log_file_path_IP_API = os.getenv('LOG_file_path_IP_API', 'D:\\Dev\\GeoLocationData\\GeoDB_merger\\log\\random_pickIPaddr_checkwith_IP-API.log') requests_per_minute = int(os.getenv('REQUESTS_PER_MINUTE', '15')) def is_ipv4(address): try: ipaddress.IPv4Address(address) return True except ipaddress.AddressValueError: return False def calculate_sample_size(total, confidence_level, margin_error, p=0.5): if not isinstance(total, int) or total <= 0: raise ValueError("Total must be a positive integer") if confidence_level not in {90, 95, 99}: raise ValueError("Confidence level must be 90, 95, or 99.") if not (isinstance(margin_error, float) or isinstance(margin_error, int)) or margin_error <= 0: raise ValueError("Margin error must be a positive number.") if not (isinstance(p, float) or isinstance(p, int)) or not (0 <= p <= 1): raise ValueError("Expected population proportion 'p' must be a number between 0 and 1.") z_dict = {90: 1.645, 95: 1.96, 99: 2.576} z = z_dict.get(confidence_level) if z is None: raise ValueError(f"The z-value corresponding to {confidence_level} was not found.") sample_size = ((z**2) * p * (1-p)) / (margin_error**2) sample_size = sample_size / (1 + ((sample_size - 1) / total)) # Return the sample size, rounded up to the nearest integer return math.ceil(sample_size) #return int(sample_size) if sample_size == int(sample_size) else int(sample_size) + 1 def calculate_accuracy(log_file_path_ipfire): match_count = 0 total_count = 0 with open(log_file_path_ipfire, 'r') as file: for line in file: total_count += 1 if ', O' in line: match_count += 1 accuracy = (match_count / total_count) * 100 if total_count > 0 else 0 return accuracy def ReadSourceCSVfile(csv_file_path): data = [] with open(csv_file_path, 'r') as file: for line in file: start_ip, end_ip, country = line.strip().split(',') if is_ipv4(start_ip) and is_ipv4(end_ip): row = { 'start_ip': start_ip, 'end_ip': end_ip, 'country': country, 'start_ip_int': int(ipaddress.IPv4Address(start_ip)), 'end_ip_int': int(ipaddress.IPv4Address(end_ip)) } data.append(row) df_ipv4 = pd.DataFrame(data) return df_ipv4 def generate_random_ips(df, sample_size, batch_size=100): all_ips = [] for _ in range(sample_size): random_row = df.sample().iloc[0] start_ip_int = random_row['start_ip_int'] end_ip_int = random_row['end_ip_int'] random_ip_int = random.randint(start_ip_int, end_ip_int) all_ips.append(str(ipaddress.IPv4Address(random_ip_int))) return all_ips def check_country_with_ipfire_db(df, ips, log_file_path_ipfire, db): match_count = 0 processed = 0 with open(log_file_path_ipfire, 'w') as log_file: for ip in ips: processed += 1 ip_int = int(ipaddress.IPv4Address(ip)) matching_rows = df[(df['start_ip_int'] <= ip_int) & (df['end_ip_int'] >= ip_int)] expected_country_code = matching_rows.iloc[0]['country'] if not matching_rows.empty else 'N/A' try: actual_country_code = db.find_country(ip) except IPAddressError: actual_country_code = 'N/A' match = 'O' if expected_country_code == actual_country_code else 'X' if match == 'O': match_count += 1 result_line = f"{ip}, {expected_country_code}, {actual_country_code}, {match}" log_file.write(result_line + '\n') log_file.flush() print(f"Processed {processed}/{len(ips)} IPs") return match_count, processed def check_country_with_ip_api(df, ips, log_file_path, batch_size=100): match_count = 0 processed = 0 with open(log_file_path, 'w') as log_file: for batch_start in range(0, len(ips), batch_size): batch_ips = ips[batch_start:batch_start + batch_size] response = requests.post('http://ip-api.com/batch?fields=status,countryCode,query', json=batch_ips) if response.status_code != 200: print(f"Request failed with status code: {response.status_code}") continue rate_limit_remaining = response.headers.get('X-Rl') rate_limit_reset = response.headers.get('X-Ttl') if rate_limit_remaining is not None: rate_limit_remaining = int(rate_limit_remaining) else: print("Warning: 'X-Rl' header is missing. Defaulting to 0.") rate_limit_remaining = 0 if rate_limit_reset is not None: rate_limit_reset = int(rate_limit_reset) else: print("Warning: 'X-Ttl' header is missing. Defaulting to 60.") rate_limit_reset = 60 if rate_limit_remaining == 0: time.sleep(rate_limit_reset) else: # Calculate the delay needed to not exceed 15 requests per minute delay = 60 / requests_per_minute time.sleep(delay) try: batch_results = response.json() except ValueError as e: print(f"Error parsing JSON response: {e}") continue for ip, res in zip(batch_ips, batch_results): processed += 1 if res['status'] == 'success': ip_int = int(ipaddress.IPv4Address(ip)) matching_rows = df[(df['start_ip_int'] <= ip_int) & (df['end_ip_int'] >= ip_int)] if not matching_rows.empty: expected_country_code = matching_rows.iloc[0]['country'] else: expected_country_code = 'N/A' actual_country_code = res.get('countryCode', 'N/A') match = 'O' if expected_country_code == actual_country_code else 'X' if match == 'O': match_count += 1 else: expected_country_code = 'N/A' actual_country_code = 'N/A' match = 'X' result_line = f"{ip}, {expected_country_code}, {actual_country_code}, {match}" log_file.write(result_line + '\n') log_file.flush() print(f"Processed {processed}/{len(ips)} IPs") return match_count, processed def main(): # Read the CSV file and calculate the sample size df_ipv4 = ReadSourceCSVfile(source_file_path) total_ipv4_rows = len(df_ipv4) print(f"Total number of IPv4 records in the data source: {total_ipv4_rows}") confidence_level = 95 margin_error = 0.05 sample_size = calculate_sample_size(total_ipv4_rows, confidence_level, margin_error) print(f"Sampling size: {sample_size}") random_ips = generate_random_ips(df_ipv4, sample_size) # Initialize LocationDatabase instance db = LocationDatabase('location.db') # Check countries with IPFire DB check_country_with_ipfire_db(df_ipv4, random_ips, log_file_path_ipfire, db) accuracy_ipfire = calculate_accuracy(log_file_path_ipfire) print(f"Accuracy (Cross-checked w/IPFire location.db): {accuracy_ipfire}%") # Check countries with IP-API check_country_with_ip_api(df_ipv4, random_ips, log_file_path_IP_API) accuracy_ip_api = calculate_accuracy(log_file_path_IP_API) print(f"Accuracy (Cross-checked w/IP-API.com HTTP-API): {accuracy_ip_api}%") if __name__ == "__main__": main() ```

The log output files.

random_pickIPaddr_checkwith_IPFireDB.log random_pickIPaddr_checkwith_IP-API.log

svaningelgem commented 8 months ago

Ok, found the ip-country database [Just need to check the agree button & download the lite database].

+Based on your first log file:+ Furthermore, I think there is something wrong with the db-ip log: image image

As you can see for that IP, everyone says it's US. Except for db-ip. For 80.49.126.115, most of them say it's Poland, except for db-ip and ipinfo.io that says it is Spain.

+Based on the 2nd one:+ 108.179.68.44 > lots of disagreement with this one... Can be either Canada, Germany, Italia or US. 109.233.32.146 > Most of them say it's US, but ipapi says it is UK.

I guess all these databases are maintained by volunteers, and hence are not really in sync with reality? I have no opinion whatsoever which one is "best"... Could even be that ipfire is the better one, and all others are liars (I doubt it though).

I know that IPfire has a daily update procedure: https://git.ipfire.org/?p=location/location-database.git;a=summary

And this information comes straight from the NICs themselves as far as I can deduce: https://git.ipfire.org/?p=location/libloc.git;a=blob;f=src/python/location/importer.py;h=f391e03018f352822c0176ae1dfb8b367443ee84;hb=HEAD

I would think that the other organizations also have something similar in place, no?

So to circle back to the original question: what is the accuracy to determine the correct country code? Depends on which one you set as being the "most truthful" I guess 😎. I like open source because I can see how it (in this case the datasource) is constructed. And because of that, I personally would give ipfire a slightly higher trust-rating than the others, but that's just me.

MaurUppi commented 8 months ago

Wow,,, it looks like you're very familiar (and using?) with those GeoIP service providers.

BTW, are you dev a portal/webUI or so? if not, how can you take the screenshot so easily?

I guess all these databases are maintained by volunteers, and hence are not really in sync with reality?

I don't think so, the service provider did something by themselves definitely and sold a commercial license.

I know that IPfire has a daily update procedure: https://git.ipfire.org/?p=location/location-database.git;a=summary

Btw, where is the location.db stored it's not in the Python script folder that I'm executing. (Edited, found it. )

So to circle back to the original question: what is the accuracy to determine the correct country code? Depends on which one you set as being the "most truthful" I guess 😎. I like open source because I can see how it (in this case the datasource) is constructed. And because of that, I personally would give ipfire a slightly higher trust-rating than the others, but that's just me.

So,,, I'm thinking about how to convert the IPFire's Location.db to be a CSV format file.

network, countryCode, ASN#, ASNorgName, is_anonymous_proxy,is_satellite_provider,is_anycast,is_drop
189.195.134.0/24,MX,262916,,,,,

Maybe your library can help to do so or part of it?

svaningelgem commented 8 months ago
  1. screenshot: Greenshot. If you like it (like I do), donate (like I did) 😁

  2. location.db location depends on what the input is that you gave to your constructor:

    db = LocationDatabase('location.db')

So this means that you will find it in the current working directory (the directory you start your program from).

  1. I created within LocationDatabase some all_* methods. Maybe they could help you? However, the ipfire database is created for fast lookup, and is not really suitable for mass-retrieval... Let me try to explain:
    • you have an ipv6 address (ip4 is a sub-range of ipv6). This is changed into a bitmask. (Within the library, this is a bitstring -which is less efficient, but more readable for debugging)
    • for every bit in this bitmask/string you check the node (starting from header.network_tree_offset). Every node has a "0" and "1" key, where each refers to another location in the file. (see database_reader.py line 136)
    • then when you arrive at a leaf (network value == 0xFFFFFFFF), go back until you find a non-leaf, and that's the root of your IP (and the rest is the subnetmask)

So all this saying is: it's possible to retrieve this back, but it use a lot of memory...

MaurUppi commented 8 months ago

Thx for the explanation so quickly.

  1. screenshot: Greenshot. If you like it (like I do), donate (like I did) 😁

It could be I was misleading you, I'm not surprised by the screenshot tool but by the content you're showing. It looks like it was made by yourself.

  1. location.db location depends on what the input is that you gave to your constructor:
    db = LocationDatabase('location.db')

So this means that you will find it in the current working directory (the directory you start your program from).

  1. I created within LocationDatabase some all_* methods. Maybe they could help you? However, the fire database is created for fast lookup, and is not really suitable for mass-retrieval... Let me try to explain:
  • you have an ipv6 address (ip4 is a sub-range of ipv6). This is changed into a bitmask. (Within the library, this is a bitstring -which is less efficient, but more readable for debugging)
  • for every bit in this bitmask/string you check the node (starting from header.network_tree_offset). Every node has a "0" and "1" key, where each refers to another location in the file. (see database_reader.py line 136)
  • then when you arrive at a leaf (network value == 0xFFFFFFFF), go back until you find a non-leaf, and that's the root of your IP (and the rest is the subnetmask)

So all this saying is: it's possible to retrieve this back, but it use a lot of memory...

The reason why I'm thinking of exporting it to the CSV format is because there are a lot of tools that can import from it to be a MaxMind mmdb format file. So I just need to make a CSV file first but no need to remake a wheel. However, the IPFire's location database.txt is not a CSV format at all.

svaningelgem commented 8 months ago

Oh, here is the website

svaningelgem commented 8 months ago

My take of the csv-dump

It did take a long time to run, and I've no clue if I interpreted it correctly... But hey. Here it is 😁: output.zip

MaurUppi commented 8 months ago

My take of the csv-dump

It did take a long time to run, and I've no clue if I interpreted it correctly... But hey. Here it is 😁: output.zip

I found Libloc included an export feature https://man-pages.ipfire.org/libloc/location.html Is the feature related to location/libloc.git / src / python / location / export.py ?

But none of the formats that I'm familiar with... maybe xt_geoip

export [--directory=DIR] [--format=FORMAT] [--family=ipv6|ipv4] [ASN|CC …]
This command exports the whole database into the given directory.
The output can be filtered by only exporting a certain address family, or by passing a list of country codes and/or ASNs. The default is to export all known countries.
The output format can be chosen with the --format parameter. For possible formats, please see below.
If the --directory option is omitted, the output will be written to stdout which is useful when you want to load any custom exports straight into nftables or ipset.
FORMATS
Some commands allow specifying the output format. This is helpful if the exported data should be imported into a packet filter for example. The following formats are understood:

- list (default): Just lists all networks, one per line
- ipset: For ipset
- nftables: For nftables
- xt_geoip: Returns a list of networks to be loaded into the xt_geoip kernel module

GeoLite2xtables/20_convert_geolite2

https://github.com/mschmitt/GeoLite2xtables/blob/e7eb9c8ee589b3ea6da71fe1c52e18ceb0c990f6/20_convert_geolite2#L92-L98

svaningelgem commented 6 months ago

Hey @MaurUppi , it seems I have missed your reaction. Was there a question in your reply? Did you mean that you would like to have an export capability?

Or is the script I added in that new branch enough as a guidance on how to loop?

If so, I'll merge the branch & close this issue.

svaningelgem commented 4 months ago

As there didn't came any reaction anymore, I've merged my changes into the master branch (no release yet because they're mostly cosmetic ones).