ooni / backend

Everything related to OONI backend infrastructure: ooni/api, ooni/pipeline, ooni/sysadmin, collector, bouncers and test-helpers
BSD 3-Clause "New" or "Revised" License
50 stars 29 forks source link

Improve pagination #127

Open fortuna opened 6 years ago

fortuna commented 6 years ago

The API does not return the same results for the same query:

for f in /tmp/result{1,2}; do
    curl "https://api.ooni.io/api/v1/measurements?probe_cc=IR&test_name=web_connectivity&limit=100&offset=5000" | grep measurement_id | sort > $f
done
diff /tmp/result{1,2}

The fact that the returned items are not deterministically determined by limit and offset means the pagination is broken.

fortuna commented 6 years ago

If you add order_by=test_start_time, the diff is gone:

for f in /tmp/result{1,2};  do
  curl "https://api.ooni.io/api/v1/measurements?probe_cc=IR&test_name=web_connectivity&limit=100&offset=5000&order_by=test_start_time" | grep measurement_id | sort > $f
done
diff /tmp/result{1,2}

However, it takes over a minute to retrieve those few entries, which is unreasonable for my applications.

hellais commented 6 years ago

Yes this is an issue with how we implement pagination.

The issue has to do with the fact that in order to have offset, limit based pagination there needs to be some ordering enforced on the results, however currently making ordering by default is too slow to be in place, so it's disabled (hence the inconsistency in the API docs and the implementation).

We are thinking of perhaps using some alternative approach to pagination that doesn't depend on ordering of the results, but haven't yet come up with a good system.

This tickets has some discussions that are in some way related: https://github.com/TheTorProject/ooni-pipeline/issues/48

fortuna commented 6 years ago

FYI, I was able to get consistent results and without significant time increase with &order_by=report_no

$ time for f in /tmp/result{1,2}; do curl "https://api.ooni.io/api/v1/measurements?probe_cc=IR&test_name=web_connectivity&limit=10000&offset=50000&order_by=report_no" | grep measurement_id | sort > $f ; done && diff /tmp/result{1,2}

real    0m9.507s
user    0m0.292s
sys 0m0.132s
hellais commented 6 years ago

This could be useful for inspiration on how to do pagination: https://github.com/djrobstep/sqlakeyset

darkk commented 5 years ago

https://api.ooni.io/files/by_country/US page may need pagination as well. At least it does not currently converge within a timeout and returns 504 Gateway Time-out.