Open fortuna opened 6 years ago
If you add order_by=test_start_time
, the diff is gone:
for f in /tmp/result{1,2}; do
curl "https://api.ooni.io/api/v1/measurements?probe_cc=IR&test_name=web_connectivity&limit=100&offset=5000&order_by=test_start_time" | grep measurement_id | sort > $f
done
diff /tmp/result{1,2}
However, it takes over a minute to retrieve those few entries, which is unreasonable for my applications.
Yes this is an issue with how we implement pagination.
The issue has to do with the fact that in order to have offset
, limit
based pagination there needs to be some ordering enforced on the results, however currently making ordering by default is too slow to be in place, so it's disabled (hence the inconsistency in the API docs and the implementation).
We are thinking of perhaps using some alternative approach to pagination that doesn't depend on ordering of the results, but haven't yet come up with a good system.
This tickets has some discussions that are in some way related: https://github.com/TheTorProject/ooni-pipeline/issues/48
FYI, I was able to get consistent results and without significant time increase with &order_by=report_no
$ time for f in /tmp/result{1,2}; do curl "https://api.ooni.io/api/v1/measurements?probe_cc=IR&test_name=web_connectivity&limit=10000&offset=50000&order_by=report_no" | grep measurement_id | sort > $f ; done && diff /tmp/result{1,2}
real 0m9.507s
user 0m0.292s
sys 0m0.132s
This could be useful for inspiration on how to do pagination: https://github.com/djrobstep/sqlakeyset
https://api.ooni.io/files/by_country/US page may need pagination as well. At least it does not currently converge within a timeout and returns 504 Gateway Time-out.
The API does not return the same results for the same query:
The fact that the returned items are not deterministically determined by
limit
andoffset
means the pagination is broken.