Closed konklone closed 10 years ago
LGTM :+1: Regarding the broken link, it looks like the single quote marks in the URL are causing problems. If I take them out, I get a regular 404. Google cache has a copy, so this is a recent problem. My guess is they just added a web application firewall. :smirk:
The
sba
scraper wasn't obeying year range if thepublished_on
timestamp wasn't found early on. There's a way in the scraper to hardcode publication dates or find them through other means, but by the time the scraper had gotten there it no longer bothered respecting theyear_range
. I've fixed that, which will make the scraper more efficient for regular running.I'm also getting an error when fetching a particular landing page -
That's from fetching this page, which gets linked at the date-less entry in this screenshot. There are no permalinks -- I found this by searching for the keyword "originating" and looking at the bottom of page 4 of results.
The new behavior throws a proper exception, but I'm not sure how to handle this. The SBA site is throwing a 500, so I'll report it to the IG. But I don't want the scraper to just skip it. I'll punt on that for a bit, after notifying SBA.