I have been working on a fork of ooni-api on/off, and have noticed a few usability issues with regards to ooni-api, specifically when it comes to which data is exposed to end-users via the API.
As it stands, there doesn't seem to be a clear, or concise way to access data collected by oonib without making (aggressive) assumptions about how ooni-probe reports are structured, specifically when it comes to handling different report types.
A few of the issues that I have noticed are:
Reports are in YAML format while developers interested in performing data analytics will tend to want to work with JSON as an intermediary format followed by SQL/Pandas/Spark DataFrames/etc.
Developers interested in performing data analytics have to write their own extract-transform-and-load (ETL) tools to work with ooni-probe results
It's difficult to distinguish between common, and optional fields within YAML reports
There is no straightforward way to access ooni-probe results without directly querying the ooni-public S3 bucket in lieu of ooni-api (without a JSON/YAML ooni-probe representation of ooni-probe reports, the high-level features captured within PostgreSQL lose value)
Report filenames within the PostgreSQL database exposed by ooni-api do not directly correspond to keys used by AWS S3
Several ooni-probe test names have subtle variations within them (e.g. bridge_reachability, bridgereachability, bridgeReachability_v2) for example
In order to handle this, I would like to propose the following amendments to the design of the PostgreSQL DB schema supporting ooni-api to make it easier for third-party developers to work with the metrics we've collected (e.g. journalists, artists, and developers interested in measuring, and analyzing censorship):
Store ooni-probe reports in both YAML/JSON within PostgreSQL in lieu of S3 to allow for easy consumption by both humans, and machines (i.e. consumption by humans via web vs. consumption by machines via REST)
Use report identifiers that can be consumed directly by the Amazon SDK for S3 in lieu of using hashes/uuids
Try to enforce a common name for fields across test-specs to allow for (fairly) generic queries to be performed
By performing the above steps, and by normalizing reporting anomalies (e.g. ooni-probe bridgeT reports not being named bridge_reachability), it would be feasible to construct a star/snowflake schema suitable for performing ad-hoc analytics on ooni-probe test results.
We're already performing a(n) (expensive) JSON -> YAML type conversion in ooni-pipeline in order to transform ooni-probe test results into a form that humans can grasp more firmly than JSON - why not export both the YAML/JSON reports within the pipeline, and store the JSON report in a PostgreSQL JSON field?
Since PostgreSQL provides out-of-the-box support for aggregate queries over JSON fields, I feel like storing both YAML/JSON would be beneficial. The additional storage space would be negligible.
Example YAML report organized into header/entry/footer sections
---
backend_version: 1.1.2
input_hashes: [1d4a3801e94158c0cf0f25605fced50a2f2ab9d538fc9f72c06cfbfa976d1f67]
options: [-f, bridges.txt, -t, '400']
probe_asn: AS29182
probe_cc: BE
probe_city: null
probe_ip: 127.0.0.1
record_type: header
report_filename: 20150101T060023Z-AS29182-bridge_reachability-v1-probe.yaml
report_id: 2015-01-01kfmgdruqhdhmwjyvkgmpeukwyderxomywrinxydg
software_name: ooniprobe
software_version: 1.2.2
start_time: 1420092023.0
test_name: bridge_reachability
test_version: 0.1.2
...
---
backend_version: 1.1.2
bridge_address: null
bridge_hashed_fingerprint: dcdb7afb15187192f4800308db3208055221b86b
distributor: unallocated
error: timeout-reached
input: dcdb7afb15187192f4800308db3208055221b86b
input_hashes: [1d4a3801e94158c0cf0f25605fced50a2f2ab9d538fc9f72c06cfbfa976d1f67]
obfsproxy_log: "2014-12-31 20:43:34,730 [WARNING] Obfsproxy (version: 0.2.12) starting\
\ up.\n2014-12-31 20:43:34,730 [INFO] Entering client managed-mode.\n2014-12-31\
\ 20:43:34,731 [ERROR] \n\n################################################\nDo\
\ NOT rely on ScrambleSuit for strong security!\n################################################\n\
\n2014-12-31 20:43:34,731 [INFO] Creating directory path `/tmp/tortmpK8lBuE/pt_state/scramblesuit/'.\n\
2014-12-31 20:43:34,732 [INFO] OBFSSOCKSv5Factory starting on 42239\n2014-12-31\
\ 20:43:34,732 [INFO] Starting factory <obfsproxy.network.socks.OBFSSOCKSv5Factory\
\ instance at 0x7fed7e2607e8>\n2014-12-31 20:43:34,732 [INFO] Starting up the event\
\ loop.\n2014-12-31 20:50:12,674 [INFO] Received SIGTERM, shutting down.\n2014-12-31\
\ 20:50:12,675 [INFO] (TCP Port 42239 Closed)\n2014-12-31 20:50:12,675 [INFO] Stopping\
\ factory <obfsproxy.network.socks.OBFSSOCKSv5Factory instance at 0x7fed7e2607e8>\n\
2014-12-31 20:50:12,675 [INFO] Main loop terminated.\n"
obfsproxy_version: 0.2.12
options: [-f, bridges.txt, -t, '400']
probe_asn: AS29182
probe_cc: BE
probe_city: null
probe_ip: 127.0.0.1
record_type: entry
report_filename: 20150101T060023Z-AS29182-bridge_reachability-v1-probe.yaml
report_id: 2015-01-01kfmgdruqhdhmwjyvkgmpeukwyderxomywrinxydg
software_name: ooniprobe
software_version: 1.2.2
start_time: 1420092023.0
success: false
test_name: bridge_reachability
test_runtime: 305.4590311050415
test_start_time: 1420094612.0
test_version: 0.1.2
timeout: 400
tor_log: 'Dec 31 20:43:32.000 [notice] Tor 0.2.5.8-rc (git-eaa9ca1011e73a9d) opening
new log file.
Dec 31 20:43:32.000 [notice] Parsing GEOIP IPv4 file /usr/share/tor/geoip.
Dec 31 20:43:32.000 [notice] Parsing GEOIP IPv6 file /usr/share/tor/geoip6.
Dec 31 20:43:32.000 [warn] You are running Tor as root. You don''t need to, and
you probably shouldn''t.
Dec 31 20:43:33.000 [notice] Bootstrapped 0%: Starting
Dec 31 20:43:33.000 [notice] Delaying directory fetches: No running bridges
Dec 31 20:43:33.000 [notice] New control connection opened from 127.0.0.1.
Dec 31 20:43:33.000 [notice] Tor 0.2.5.8-rc (git-eaa9ca1011e73a9d) opening log file.
Dec 31 20:43:36.000 [notice] Bootstrapped 5%: Connecting to directory server
Dec 31 20:43:36.000 [notice] Bootstrapped 10%: Finishing handshake with directory
server
Dec 31 20:48:35.000 [warn] Problem bootstrapping. Stuck at 10%: Finishing handshake
with directory server. (DONE; DONE; count 1; recommendation warn)
Dec 31 20:48:35.000 [warn] 1 connections have failed:
Dec 31 20:48:35.000 [warn] 1 connections died in state handshaking (TLS) with SSL
state unknown state in HANDSHAKE
Dec 31 20:50:12.000 [notice] Catching signal TERM, exiting cleanly.
'
tor_progress: 10
tor_progress_summary: Finishing handshake with directory server
tor_progress_tag: handshake_dir
tor_version: 0.2.5.8-rc
transport: ss
transport_name: scramblesuit
...
---
backend_version: 1.1.2
input_hashes: [1d4a3801e94158c0cf0f25605fced50a2f2ab9d538fc9f72c06cfbfa976d1f67]
options: [-f, bridges.txt, -t, '400']
probe_asn: AS29182
probe_cc: BE
probe_city: null
probe_ip: 127.0.0.1
record_type: footer
report_filename: 20150101T060023Z-AS29182-bridge_reachability-v1-probe.yaml
report_id: 2015-01-01kfmgdruqhdhmwjyvkgmpeukwyderxomywrinxydg
software_name: ooniprobe
software_version: 1.2.2
stage_1_process_time: 1.8597300052642822
start_time: 1420092023.0
test_name: bridge_reachability
test_version: 0.1.2
...
@TylerJFisher commented on Tue Dec 01 2015
Hello,
I have been working on a fork of ooni-api on/off, and have noticed a few usability issues with regards to ooni-api, specifically when it comes to which data is exposed to end-users via the API.
As it stands, there doesn't seem to be a clear, or concise way to access data collected by
oonib
without making (aggressive) assumptions about how ooni-probe reports are structured, specifically when it comes to handling different report types.A few of the issues that I have noticed are:
YAML
format while developers interested in performing data analytics will tend to want to work withJSON
as an intermediary format followed bySQL
/Pandas
/SparkDataFrames
/etc.ooni-probe
resultsooni-probe
results without directly querying theooni-public
S3 bucket in lieu ofooni-api
(without a JSON/YAMLooni-probe
representation ofooni-probe
reports, the high-level features captured within PostgreSQL lose value)ooni-api
do not directly correspond to keys used by AWS S3ooni-probe
test names have subtle variations within them (e.g.bridge_reachability
,bridgereachability
,bridgeReachability_v2
) for exampleIn order to handle this, I would like to propose the following amendments to the design of the PostgreSQL DB schema supporting
ooni-api
to make it easier for third-party developers to work with the metrics we've collected (e.g. journalists, artists, and developers interested in measuring, and analyzing censorship):ooni-probe
reports in bothYAML
/JSON
within PostgreSQL in lieu ofS3
to allow for easy consumption by both humans, and machines (i.e. consumption by humans via web vs. consumption by machines via REST)By performing the above steps, and by normalizing reporting anomalies (e.g.
ooni-probe
bridgeT reports not being namedbridge_reachability
), it would be feasible to construct a star/snowflake schema suitable for performing ad-hoc analytics onooni-probe
test results.Data currently exposed by ooni-api
ooni-api
currently exposes [the following properties for all reports (https://raw.githubusercontent.com/TheTorProject/ooni-spec/master/data-formats/df-000-base.md), with a handful of optional fields which may, or may not be documented in ooni-spec (e.g. all of the optional fields inbridge_reachability
ooni-probes)Where would these enhancements fit in?
JSON
->YAML
type conversion inooni-pipeline
in order to transformooni-probe
test results into a form that humans can grasp more firmly thanJSON
- why not export both theYAML
/JSON
reports within the pipeline, and store theJSON
report in a PostgreSQLJSON
field?Since PostgreSQL provides out-of-the-box support for aggregate queries over
JSON
fields, I feel like storing bothYAML
/JSON
would be beneficial. The additional storage space would be negligible.Example YAML report organized into header/entry/footer sections