mediacloud / rss-fetcher

Intelligently fetch lists of URLs from a large collection of RSS Feeds as part of the Media Cloud Directory.
https://search.mediacloud.org/directory
Apache License 2.0
5 stars 5 forks source link

possible race in scripts.update_feeds process #27

Open philbudne opened 8 months ago

philbudne commented 8 months ago

BACKGROUND:

scripts.update_feeds uses the mediacloud API client to access the mcweb "sources" API. I believe pagination for mcweb access is implemented by re-issuing a query and selecting a range of the resulting rows.

scripts.update_feeds uses the feed_list method of DirectoryApi in https://github.com/mediacloud/api-client/blob/main/mediacloud/api.py passing:

  1. modified_since=mcweb_server_time.time_of_previous_update
  2. modified_before=mcweb_server_time.time_at_start_of_current update
  3. limit=page_size
  4. offset=page_number*page_size

Which issues a request to mcweb:/api/sources/feeds endpoint implemented by the FeedsViewSet object in https://github.com/mediacloud/web-search/blob/main/mcweb/backend/sources/api.py

The query is filtered with: modified_at__gte=modified_since and modified_at__lt=modified_before

I had modified get_queryset so that when both modified_since and modified_before are specified, the results are ordered by id thinking that would eliminate a different race that could occur if the results were ordered by modified_at.

Pagination for the query is implemented by django-rest-framework LimitOffsetPagination, which (if I recall correctly) issues a SQL SELECT statement with LIMIT and OFFSET clauses which select rows based on the result order, see: https://github.com/encode/django-rest-framework/blob/master/rest_framework/pagination.py

SCENARIO:

  1. If feed N had been updated in the past five minutes
  2. Then Feed N is modified again (not TERRIBLY unlikely, if someone has edited the feed, and then goes back to fix their fix) WHILE scripts.
  3. update_feeds is running (an admittedly SMALL time window)
  4. If another page is fetched, the list of updated feeds modified in the time window given in the query will have changed: Feed N will no longer appear, making the result set one entry smaller, and an n entry for feed Q that WOULD have appeared first in the NEXT page to be fetched (before Feed N changed again), will no longer appear on that page (it will have slipped onto the previous page, which won't be fetched again), so Feed Q's change will have been lost. BUT, Feed N's updates won't have been lost!

My take is that for a web UI with a human consumer this is acceptable.

My thought at the time was to add ANOTHER parameter to the query, id_after (with filter id__gt) that is passed the feed id of the last feed on the page that was just fetched.

BUT Django REST framework offers multiple flavors of pagination: https://www.django-rest-framework.org/api-guide/pagination/ and it's possible one solves the problem (but would effect the API for other uses).