radiolarian / AO3Scraper

A Python scraper for getting fan fiction content and metadata from Archive of Our Own.
172 stars 55 forks source link

"Access Denied" while scraping #17

Closed theoevans1 closed 3 years ago

theoevans1 commented 3 years ago

Hi! I've tried a few times to run python ao3_get_fanfics.py, and it's successfully scraping around half of the stories but the rest are coming back "Access Denied." I tried adding this http header flag but it didn’t seem to help: --header 'Chrome/88.0.4324.146 (Macintosh; Intel Mac OS X 10.15.7); Theo Evans/University of Chicago/theoevans@uchicago.edu'

Any ideas of what might be going wrong?

Thank you!

jaevibing commented 3 years ago

Access being denied is common, usually, the servers might suspect you of being a bot and temporarily deny you from scraping. This issue is unfixable, maybe use a different header? Overall you can't really fix this.

ssterman commented 3 years ago

"Access Denied" is the scraper's error message for private fics. The errors file will contain the work IDs for the Access Denied fics; if you navigate directly to that work (e.g. https://archiveofourown.org/works/ID), you can check if that is the issue (or a related access restriction). We cannot (and should not) scrape private fics.

theoevans1 commented 3 years ago

@ssterman Hm, they don't seem to be private or otherwise restricted. It's generally been alternating between several in a row scraped successfully and several in a row Access Denied.

Screen Shot 2021-03-01 at 10 46 21 PM
ssterman commented 3 years ago

In that case @jack-debug may be correct; the scraper outputs "access denied" if there was an error or if it can't find the body text, which might happen if you're being blocked. Try increasing the delay between page accesses. You can also extract the failed IDs from the error file and retry only those in a separate batch.

theoevans1 commented 3 years ago

Makes sense, I'll try running the failed IDs again. Thank you for your help!