radiolarian / AO3Scraper

A Python scraper for getting fan fiction content and metadata from Archive of Our Own.
175 stars 56 forks source link

ao3_work_ids improvements -- load existing seen_ids, retry on 429 #37

Closed katfang closed 2 years ago

katfang commented 2 years ago

Problem: the script quit on me after ~100-150 pages of pulling IDs.

Fixes:

  1. Load existing seen_ids from the CSV output file provided. Means we will add to the existing file and avoid duplicate IDs.
  2. Make seen_ids a set. Minor improvement, but it's mostly look up, so a set is better.
  3. Retries on 429. A 429 Retry Later causes the script to think there are no more works and subsequently quit out. This retries with a 5s delay until we get a non-429 response.