scripts.update_feeds uses the mediacloud API client to access the mcweb "sources" API. I believe pagination for mcweb access is implemented by re-issuing a query and selecting a range of the resulting rows.
The query is filtered with: modified_at__gte=modified_since and modified_at__lt=modified_before
I had modified get_queryset so that when both modified_since and modified_before are specified, the results are ordered by id thinking that would eliminate a different race that could occur if the results were ordered by modified_at.
If feed N had been updated in the past five minutes
Then Feed N is modified again (not TERRIBLY unlikely, if someone has edited the feed, and then goes back to fix their fix) WHILE scripts.
update_feeds is running (an admittedly SMALL time window)
If another page is fetched, the list of updated feeds modified in the time window given in the query will have changed: Feed N will no longer appear, making the result set one entry smaller, and an n entry for feed Q that WOULD have appeared first in the NEXT page to be fetched (before Feed N changed again), will no longer appear on that page (it will have slipped onto the previous page, which won't be fetched again), so Feed Q's change will have been lost. BUT, Feed N's updates won't have been lost!
My take is that for a web UI with a human consumer this is acceptable.
My thought at the time was to add ANOTHER parameter to the query, id_after (with filter id__gt) that is passed the feed id of the last feed on the page that was just fetched.
BACKGROUND:
scripts.update_feeds uses the mediacloud API client to access the mcweb "sources" API. I believe pagination for mcweb access is implemented by re-issuing a query and selecting a range of the resulting rows.
scripts.update_feeds uses the feed_list method of DirectoryApi in https://github.com/mediacloud/api-client/blob/main/mediacloud/api.py passing:
Which issues a request to mcweb:/api/sources/feeds endpoint implemented by the
FeedsViewSet
object in https://github.com/mediacloud/web-search/blob/main/mcweb/backend/sources/api.pyThe query is filtered with:
modified_at__gte=modified_since
andmodified_at__lt=modified_before
I had modified
get_queryset
so that when both modified_since and modified_before are specified, the results are ordered byid
thinking that would eliminate a different race that could occur if the results were ordered bymodified_at
.Pagination for the query is implemented by django-rest-framework LimitOffsetPagination, which (if I recall correctly) issues a SQL SELECT statement with LIMIT and OFFSET clauses which select rows based on the result order, see: https://github.com/encode/django-rest-framework/blob/master/rest_framework/pagination.py
SCENARIO:
My take is that for a web UI with a human consumer this is acceptable.
My thought at the time was to add ANOTHER parameter to the query, id_after (with filter id__gt) that is passed the feed id of the last feed on the page that was just fetched.
BUT Django REST framework offers multiple flavors of pagination: https://www.django-rest-framework.org/api-guide/pagination/ and it's possible one solves the problem (but would effect the API for other uses).