Dataset generation PoC and next steps (?)

Hi there! Since last time we were stuck on the IPv4 discussion i wanted to contribute with a small PoC in order to start a small MVP

In order to generate automatic italian IPv4 lists i relied upon RIPE APIs filtering per country (Yes, we all know that these are not all/only italian but "self declarated" IT ASNs)

Once we get our dataset parsed in multiple sources we should see how to integrate and enrich the dataset with the network scans/scrape

I messed around with the pipelines and this is the result dataset - repo

The data is scraped every night when the CICD pipeline is scheduled to run in order to scrape the raw JSON data from RIPE APIs then via the same pipeline the parsed data is pushed into the same folder (We could generate also separate repos for datasets and so on..)

Once the data is generated is publicly available to anyone to use with their favorite scanners (nmap,Zmap,masscan.. etc)

If we want to generate a public raw dataset of scanned IT assets (shodan-like) but enriched the only thing we need to do is to setup a probe and integrate the input datasets updated daily (e.g. using masscan and raw iplist + some scraper and then store the data on a public repo)

NOTE:

the scanning server will scan only information relative to services and doesn't store infomation
the scanning server will use a particular header notifying the destination that is being scanned by a legitimate service (+ courtesy page +content removal policy etc..)
We should take notes from (shadowserver foundation :) )

If i made some mistakes sorry but wrote this in a twinkling

osservatoriosicurezza / Perimetro-Cibernetico-Italiano

Dataset generation PoC and next steps (?) #3