reinventalbany / esd-crawl

Web crawler to find data on Empire State Development site
MIT License
0 stars 0 forks source link

find all broken links #53

Closed afeld closed 1 year ago

afeld commented 1 year ago

Closes https://github.com/reinventalbany/esd-crawl/issues/54.

afeld commented 1 year ago
$ scrapy runspider esd_crawl/spiders/broken.py -L INFO -O results/broken.csv
2022-11-27 13:02:23 [scrapy.utils.log] INFO: Scrapy 2.7.1 started (bot: esd_crawl)
2022-11-27 13:02:23 [scrapy.utils.log] INFO: Versions: lxml 4.9.1.0, libxml2 2.9.13, cssselect 1.2.0, parsel 1.7.0, w3lib 2.0.1, Twisted 22.10.0, Python 3.10.8 (main, Oct 13 2022, 09:48:40) [Clang 14.0.0 (clang-1400.0.29.102)], pyOpenSSL 22.1.0 (OpenSSL 3.0.7 1 Nov 2022), cryptography 38.0.3, Platform macOS-13.0.1-arm64-arm-64bit
2022-11-27 13:02:23 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'esd_crawl',
 'EDITOR': 'vim',
 'HTTPCACHE_ENABLED': True,
 'HTTPCACHE_EXPIRATION_SECS': 604800,
 'HTTPCACHE_POLICY': 'scrapy.extensions.httpcache.RFC2616Policy',
 'LOG_LEVEL': 'INFO',
 'NEWSPIDER_MODULE': 'esd_crawl.spiders',
 'REFERRER_POLICY': 'unsafe-url',
 'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
 'SPIDER_LOADER_WARN_ONLY': True,
 'SPIDER_MODULES': ['esd_crawl.spiders'],
 'USER_AGENT': 'esd_crawl (+https://github.com/reinventalbany/esd-crawl)'}
2022-11-27 13:02:23 [scrapy.extensions.telnet] INFO: Telnet Password: 8440d09556db1be1
2022-11-27 13:02:23 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2022-11-27 13:02:23 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats',
 'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware']
2022-11-27 13:02:23 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-11-27 13:02:23 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-11-27 13:02:23 [scrapy.core.engine] INFO: Spider opened
2022-11-27 13:02:23 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-11-27 13:02:23 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-11-27 13:03:11 [scrapy.core.scraper] ERROR: Error downloading <GET http://Laura.Magee@esd.ny.gov>
Traceback (most recent call last):
  File "/Users/afeld/Library/Caches/pypoetry/virtualenvs/esd-crawl-fJL88Oq4-py3.10/lib/python3.10/site-packages/scrapy/core/downloader/middleware.py", line 49, in process_request
    return (yield download_func(request=request, spider=spider))
ValueError: invalid hostname: Laura.Magee@esd.ny.gov
2022-11-27 13:03:11 [scrapy.core.scraper] ERROR: Error downloading <GET http://pressoffice@esd.ny.gov>
Traceback (most recent call last):
  File "/Users/afeld/Library/Caches/pypoetry/virtualenvs/esd-crawl-fJL88Oq4-py3.10/lib/python3.10/site-packages/scrapy/core/downloader/middleware.py", line 49, in process_request
    return (yield download_func(request=request, spider=spider))
ValueError: invalid hostname: pressoffice@esd.ny.gov
2022-11-27 13:03:23 [scrapy.extensions.logstats] INFO: Crawled 1629 pages (at 1629 pages/min), scraped 5 items (at 5 items/min)
2022-11-27 13:04:23 [scrapy.extensions.logstats] INFO: Crawled 4111 pages (at 2482 pages/min), scraped 10 items (at 5 items/min)
2022-11-27 13:05:23 [scrapy.extensions.logstats] INFO: Crawled 6589 pages (at 2478 pages/min), scraped 39 items (at 29 items/min)
2022-11-27 13:05:46 [scrapy.core.scraper] ERROR: Error downloading <GET http://mailto:%25E2%2580%258BMo.Sumdundu@esd.ny.gov>
Traceback (most recent call last):
  File "/Users/afeld/Library/Caches/pypoetry/virtualenvs/esd-crawl-fJL88Oq4-py3.10/lib/python3.10/site-packages/scrapy/core/downloader/middleware.py", line 49, in process_request
    return (yield download_func(request=request, spider=spider))
ValueError: invalid hostname: mailto:%25E2%2580%258BMo.Sumdundu@esd.ny.gov
2022-11-27 13:06:23 [scrapy.extensions.logstats] INFO: Crawled 8779 pages (at 2190 pages/min), scraped 50 items (at 11 items/min)
2022-11-27 13:07:23 [scrapy.extensions.logstats] INFO: Crawled 10402 pages (at 1623 pages/min), scraped 69 items (at 19 items/min)
2022-11-27 13:08:23 [scrapy.extensions.logstats] INFO: Crawled 13168 pages (at 2766 pages/min), scraped 86 items (at 17 items/min)
2022-11-27 13:09:23 [scrapy.extensions.logstats] INFO: Crawled 15693 pages (at 2525 pages/min), scraped 89 items (at 3 items/min)
2022-11-27 13:10:23 [scrapy.extensions.logstats] INFO: Crawled 18408 pages (at 2715 pages/min), scraped 94 items (at 5 items/min)
2022-11-27 13:11:23 [scrapy.extensions.logstats] INFO: Crawled 21749 pages (at 3341 pages/min), scraped 105 items (at 11 items/min)
2022-11-27 13:12:23 [scrapy.extensions.logstats] INFO: Crawled 24271 pages (at 2522 pages/min), scraped 123 items (at 18 items/min)
2022-11-27 13:13:14 [scrapy.spidermiddlewares.urllength] INFO: Ignoring link (url length > 2083): http://empire state development (esd) today announced that new york state has designated four undeveloped parcels at sterling business park in orchard park as “shovel ready,” ensuring the site’s readiness for development as a multi-tenant medical, business and technology park. the four parcels, comprising the 23.9 available acres, have satisfied the requirements for new york state shovel ready certification and have been pre-qualified for development, meaning that businesses will be able to locate there without delay. companies will still need to obtain permits to operate that will be based on their individual needs and circumstances, but they can be assured there are no wetlands, endangered species, historic artifacts or other concerns that might delay a project. utilities, including electricity, gas, water, sewer and fiber, are also available to the site.   “new york state’s shovel ready program provides valuable enticements to businesses, as well as job opportunities for local residents,” said empire state development president, ceo & commissioner howard zemsky. “by gaining this certification for the sterling business park, the potential to attract developers and business interest to orchard park is increased substantially, while the time to complete any project is significantly reduced.”  “we expect that this shovel ready designation will be the catalyst to spur more development within the park as buyers will have the confidence that their projects can be approved in a timely manner,” said orchard park commerce center president jeff steinwachs.    the 23.9 acres of undeveloped property is part of a larger, 100-acre business park that is already home to several commercial businesses.  the available parcels range from 1.6 to 11.5 acres.  the properties are located adjacent to u.s. route 219 and are minutes from the new york state thruway, buffalo niagara international airport and city of buffalo.  the site may be eligible for new york power authority (nypa) low-cost hydroelectric power allocation and western new york power proceeds allocations, which businesses will have to apply for separately.  current occupants of the sterling business park include medical, office and light industrial uses.    sales for the shovel ready parcels are being handled by richard j. schechter, associate real estate broker at pyramid brokerage company.  for more information on these parcels at sterling business park, contact richard schechter at 716-852-7500, ext. 102 or 716-316-4040 (cell), or jeff steinwachs at 941-383-0148.  shovel ready certification is administered by empire state development, the state’s lead economic development organization.  the program is an ongoing component of the build now-ny program.  the state does not provide funding for shovel ready applications, but does coordinate the review of each application with all relevant state and federal agencies.  in this case, as part of the process, the site was evaluated by representatives of esd, the new york state department of environmental conservation, new york state department of transportation, new york state office of historic preservation, new york state department of agriculture & markets and the u.s. army corp. of engineers.  each of these agencies has looked at the potential for development at this location and resolved any concerns in advance of a project.   for more information on new york state’s shovel ready program, visit https//esd.ny.gov/businessprograms/data/buildnow/
2022-11-27 13:13:23 [scrapy.extensions.logstats] INFO: Crawled 26416 pages (at 2145 pages/min), scraped 153 items (at 30 items/min)
2022-11-27 13:13:40 [scrapy.core.downloader.tls] WARNING: Remote certificate is not valid for hostname "www.canalsidebuffalo.com"; VerificationError(errors=[DNSMismatch(mismatched_id=DNS_ID(hostname=b'www.canalsidebuffalo.com'))])
2022-11-27 13:13:40 [scrapy.core.downloader.tls] WARNING: Remote certificate is not valid for hostname "www.canalsidebuffalo.com"; VerificationError(errors=[DNSMismatch(mismatched_id=DNS_ID(hostname=b'www.canalsidebuffalo.com'))])
2022-11-27 13:13:40 [scrapy.core.downloader.tls] WARNING: Remote certificate is not valid for hostname "www.canalsidebuffalo.com"; VerificationError(errors=[DNSMismatch(mismatched_id=DNS_ID(hostname=b'www.canalsidebuffalo.com'))])
2022-11-27 13:13:48 [scrapy.spidermiddlewares.urllength] INFO: Ignoring link (url length > 2083): http://nysfilm.com/Film_Making_Schools.cfm?fp=VBQKkpG%2BfGiGvWraXKsuL31rxhqnbknWCd8Hjb%2Fp4SwZTho6kiSlEAmL6tfpg3lmtflBxxF7%2BHZZBokMNJCOyYhbzTj2OIgo7oINmspKWiOr%2BK%2F%2BgNym7lbxSp3sEgOKnF46Gq7CWyeE3Ykm7kdtIWg%2Fv70uN196WanrzG85H%2FS3NvEERcrbVdx9ZroqHRsnEdzxfErXAzbkG4crjYo1kgmNhNFPaTQHk0eOA%2F7N3%2Bjr1ewYkGRxKjwW75%2BHsub1PhP2GZFiawCHjT%2Faa4ynLUye9PFXXseO4J9rBu1PRHYOvJ446rnbuBvomZhjaS2A&yep=KqDMXoygkAGhDKndKjQRQHVL64hoCApvRfAJGNeXARSJPJLJGfiLbPNxKbOqjShz3ZAgaiGMLebyM9tWpnsQXDarStXs1LK6Mj%2F52gzbT5vR7NBGcZdm0TnKW0RwQGJfTwaifOiD71wo43NyuBc08yFm9RWKqQvhDTfr54CuSqEbYN857cyGAi9zWiBNgM35ZfDZuwwqFhfq71Cw7KEnxI6l4%2B%2BdjT6DvUCagXjpAh%2Fuer%2FiwcgVcWHPloB4nrqksp3ggfECd%2B4OjxvKslAwfXZVuopUl3f8bYECOggodXQLO%2BQRqDYBMsE1HG6x1wZeW3in4nKxO5byhlR1RwjBQzgnMIFYRgoFEO9HQODPcgyWyDNFtsWLb2qb93BAQKhIuSLCenmMRzjpIJXQdtagiU4abzvPa1ikO49lEZQQl%2BF%2BR%2BtZIo5FxL%2FCHdsR4cssjJB7O1r5EaTGRjDOUAf0nAbXDcAooWQBXkBL7QUc0brCXa33au65SMve75ZLvMcsXwrVs3vZ45qEwEt8U8Zb4UexVDcJ8dotORp3mmRjqQ4CoP62RQEdClru9yqnAumUAGsZ%2B6ktZzhVt5yW9wbROyrougIE7Ru6yKN%2BuRzyn0aXt9vP42%2F0IslS12Md2laxIru03%2FfVh68mNB3lzVcT%2FJVET0FVckYKCZyEl4Su2fbknbLdkX%2BQbkAHf%2Fjdk0sALmrcoB8FC8VavB0i6WGKYJgsOp7Q8qUQ1RtBd0oyCbe%2FXS30ZrzHullL8V%2Bg2TPMfbmDAe%2FOYilQZiSKwNttPdueK56DZV88ztC8gKgLu%2FX3nKIi8XEFDEY9ssMW8rBKSCrJ6x%2F8cmiQkDUJWKebULg3GpgeEBmjGAs7e4Af%2BRNIr4bAaQg6IcD1Ogukb%2BZTDyuFXlM%2BbY09bD%2BfV8dkQtMld4brPhOvsnQSl1cQ%2BjRGYgCggfF9TsJ8rUWQaWhi2QlXUTd1S35uxJIol61%2BJzeYkBBRk0Z0UgUor8eQUkNrOSROU5ZHbDvUUDXV3aH6MnidiQC7HEJkQJrehSTWhnbIhMXVm4mrr3SLSpZ%2FH49isNPQ24uIs9Z%2Fpgm0hi9O%2FfX9nr2QXSBIyx9q0K%2FrVgrULEvCpu3%2BaE3IQhaWO14hcLsBagizg8x2ouqap4mIg1s6%2BRZnunHKV4ukJRxGDK%2Fz0UBcF2jYopWWZtPNupV%2BoggtA5TuLueet1nGuXlfL%2Bm8EoUwA%2BqSyafXzMQYwFrPRS6mbvJYgJdrULVMxX89sLk5f6B%2BA9Bx2bmjJr9UQ9%2Fm1UHa%2BIEL5NPHnyxtpngksCiK6ewHYRPf6bPPDLWIdU%2Bv7u7OLt1VcmV4s9trXQ11w02fRiwty7609XkxgOIpcTs6V7nCyjWYu7f6aa7dsw2j5PQ7SUh%2B8gOQRPZAORE5zY00EapSIQTIICqLszfk7pkGEPSojpBNqe5ahTKFsWmHGjbJrxLkwyNJv%2F9cmotnuJ%2BorFVJN0%2FYgnmlNk1bpyoSLLSm0AylbsPAJyAOfxkJKilQ2ieNkDzW0ZEejKEe5Zieq9rx%2Fsaa169q1Sa1UQ0hSOQxSzo36iLaZJTUlNzkQ8mN%2BmYJ%2FwgvxFwEktGqdI5AWnyx985gyxkYY1vSBy3NkCXgJ22%2BtL3IaHa872htB19wk7%2Bz3O4Wiy2nr0iiolJEiPqrBLVwlHr1O74G9audM4rA9yX60bwEc%2Fa0ovDtac9KZrag8efLlPUzC0qhmxm8mqfXQJXlvL8x6VR9zlOFszFgAmCGwcmXhK4%3D&gtnp=0&gtpp=0&kbetu=1&maxads=0&kld=1042&yprpnd=UHM6ofc%2BmzTMdphcWy%2Bzzw%3D%3D&_opnslfp=1&prvtof=YxnOaFd0fqy%2FqegeV7DTyI03rUjdtEBPkt771s7v%2FoOGMM%2BJECey3BbwrKohaG5zXbGblipJzY0UOfjuaVUyP90ZeSNUDjLQjmu2QOE4eKXCHg1U1zwZs%2FliGHSy1Chd4iimRQUeK9BdEm%2B9ovcksg%3D%3D&bkt=14907&d_bkt=14907&skpc=1&&gtnp=0&gtpp=0&kt=362&&kbc=nys+film&ki=10795574&ktd=0&kld=1042&kp=1
2022-11-27 13:13:48 [scrapy.spidermiddlewares.urllength] INFO: Ignoring link (url length > 2083): http://nysfilm.com/Film_School_College.cfm?fp=VBQKkpG%2BfGiGvWraXKsuL31rxhqnbknWCd8Hjb%2Fp4SwZTho6kiSlEAmL6tfpg3lmtflBxxF7%2BHZZBokMNJCOyYhbzTj2OIgo7oINmspKWiOr%2BK%2F%2BgNym7lbxSp3sEgOKnF46Gq7CWyeE3Ykm7kdtIWg%2Fv70uN196WanrzG85H%2FS3NvEERcrbVdx9ZroqHRsnEdzxfErXAzbkG4crjYo1kgmNhNFPaTQHk0eOA%2F7N3%2Bjr1ewYkGRxKjwW75%2BHsub1PhP2GZFiawCHjT%2Faa4ynLUye9PFXXseO4J9rBu1PRHYOvJ446rnbuBvomZhjaS2A&yep=KqDMXoygkAGhDKndKjQRQHVL64hoCApvRfAJGNeXARSJPJLJGfiLbPNxKbOqjShz3ZAgaiGMLebyM9tWpnsQXDarStXs1LK6Mj%2F52gzbT5vR7NBGcZdm0TnKW0RwQGJfTwaifOiD71wo43NyuBc08yFm9RWKqQvhDTfr54CuSqEbYN857cyGAi9zWiBNgM35ZfDZuwwqFhfq71Cw7KEnxI6l4%2B%2BdjT6DvUCagXjpAh%2Fuer%2FiwcgVcWHPloB4nrqksp3ggfECd%2B4OjxvKslAwfXZVuopUl3f8bYECOggodXQLO%2BQRqDYBMsE1HG6x1wZeW3in4nKxO5byhlR1RwjBQzgnMIFYRgoFEO9HQODPcgyWyDNFtsWLb2qb93BAQKhIuSLCenmMRzjpIJXQdtagiU4abzvPa1ikO49lEZQQl%2BF%2BR%2BtZIo5FxL%2FCHdsR4cssjJB7O1r5EaTGRjDOUAf0nAbXDcAooWQBXkBL7QUc0brCXa33au65SMve75ZLvMcsXwrVs3vZ45qEwEt8U8Zb4UexVDcJ8dotORp3mmRjqQ4CoP62RQEdClru9yqnAumUAGsZ%2B6ktZzhVt5yW9wbROyrougIE7Ru6yKN%2BuRzyn0aXt9vP42%2F0IslS12Md2laxIru03%2FfVh68mNB3lzVcT%2FJVET0FVckYKCZyEl4Su2fbknbLdkX%2BQbkAHf%2Fjdk0sALmrcoB8FC8VavB0i6WGKYJgsOp7Q8qUQ1RtBd0oyCbe%2FXS30ZrzHullL8V%2Bg2TPMfbmDAe%2FOYilQZiSKwNttPdueK56DZV88ztC8gKgLu%2FX3nKIi8XEFDEY9ssMW8rBKSCrJ6x%2F8cmiQkDUJWKebULg3GpgeEBmjGAs7e4Af%2BRNIr4bAaQg6IcD1Ogukb%2BZTDyuFXlM%2BbY09bD%2BfV8dkQtMld4brPhOvsnQSl1cQ%2BjRGYgCggfF9TsJ8rUWQaWhi2QlXUTd1S35uxJIol61%2BJzeYkBBRk0Z0UgUor8eQUkNrOSROU5ZHbDvUUDXV3aH6MnidiQC7HEJkQJrehSTWhnbIhMXVm4mrr3SLSpZ%2FH49isNPQ24uIs9Z%2Fpgm0hi9O%2FfX9nr2QXSBIyx9q0K%2FrVgrULEvCpu3%2BaE3IQhaWO14hcLsBagizg8x2ouqap4mIg1s6%2BRZnunHKV4ukJRxGDK%2Fz0UBcF2jYopWWZtPNupV%2BoggtA5TuLueet1nGuXlfL%2Bm8EoUwA%2BqSyafXzMQYwFrPRS6mbvJYgJdrULVMxX89sLk5f6B%2BA9Bx2bmjJr9UQ9%2Fm1UHa%2BIEL5NPHnyxtpngksCiK6ewHYRPf6bPPDLWIdU%2Bv7u7OLt1VcmV4s9trXQ11w02fRiwty7609XkxgOIpcTs6V7nCyjWYu7f6aa7dsw2j5PQ7SUh%2B8gOQRPZAORE5zY00EapSIQTIICqLszfk7pkGEPSojpBNqe5ahTKFsWmHGjbJrxLkwyNJv%2F9cmotnuJ%2BorFVJN0%2FYgnmlNk1bpyoSLLSm0AylbsPAJyAOfxkJKilQ2ieNkDzW0ZEejKEe5Zieq9rx%2Fsaa169q1Sa1UQ0hSOQxSzo36iLaZJTUlNzkQ8mN%2BmYJ%2FwgvxFwEktGqdI5AWnyx985gyxkYY1vSBy3NkCXgJ22%2BtL3IaHa872htB19wk7%2Bz3O4Wiy2nr0iiolJEiPqrBLVwlHr1O74G9audM4rA9yX60bwEc%2Fa0ovDtac9KZrag8efLlPUzC0qhmxm8mqfXQJXlvL8x6VR9zlOFszFgAmCGwcmXhK4%3D&gtnp=0&gtpp=0&kbetu=1&maxads=0&kld=1042&yprpnd=UHM6ofc%2BmzTMdphcWy%2Bzzw%3D%3D&_opnslfp=1&prvtof=YxnOaFd0fqy%2FqegeV7DTyI03rUjdtEBPkt771s7v%2FoOGMM%2BJECey3BbwrKohaG5zXbGblipJzY0UOfjuaVUyP90ZeSNUDjLQjmu2QOE4eKXCHg1U1zwZs%2FliGHSy1Chd4iimRQUeK9BdEm%2B9ovcksg%3D%3D&bkt=14907&d_bkt=14907&skpc=1&&gtnp=0&gtpp=0&kt=362&&kbc=nys+film&ki=10797651&ktd=0&kld=1042&kp=2
2022-11-27 13:13:48 [scrapy.spidermiddlewares.urllength] INFO: Ignoring link (url length > 2083): http://nysfilm.com/NY_Film_Schools.cfm?fp=VBQKkpG%2BfGiGvWraXKsuL31rxhqnbknWCd8Hjb%2Fp4SwZTho6kiSlEAmL6tfpg3lmtflBxxF7%2BHZZBokMNJCOyYhbzTj2OIgo7oINmspKWiOr%2BK%2F%2BgNym7lbxSp3sEgOKnF46Gq7CWyeE3Ykm7kdtIWg%2Fv70uN196WanrzG85H%2FS3NvEERcrbVdx9ZroqHRsnEdzxfErXAzbkG4crjYo1kgmNhNFPaTQHk0eOA%2F7N3%2Bjr1ewYkGRxKjwW75%2BHsub1PhP2GZFiawCHjT%2Faa4ynLUye9PFXXseO4J9rBu1PRHYOvJ446rnbuBvomZhjaS2A&yep=KqDMXoygkAGhDKndKjQRQHVL64hoCApvRfAJGNeXARSJPJLJGfiLbPNxKbOqjShz3ZAgaiGMLebyM9tWpnsQXDarStXs1LK6Mj%2F52gzbT5vR7NBGcZdm0TnKW0RwQGJfTwaifOiD71wo43NyuBc08yFm9RWKqQvhDTfr54CuSqEbYN857cyGAi9zWiBNgM35ZfDZuwwqFhfq71Cw7KEnxI6l4%2B%2BdjT6DvUCagXjpAh%2Fuer%2FiwcgVcWHPloB4nrqksp3ggfECd%2B4OjxvKslAwfXZVuopUl3f8bYECOggodXQLO%2BQRqDYBMsE1HG6x1wZeW3in4nKxO5byhlR1RwjBQzgnMIFYRgoFEO9HQODPcgyWyDNFtsWLb2qb93BAQKhIuSLCenmMRzjpIJXQdtagiU4abzvPa1ikO49lEZQQl%2BF%2BR%2BtZIo5FxL%2FCHdsR4cssjJB7O1r5EaTGRjDOUAf0nAbXDcAooWQBXkBL7QUc0brCXa33au65SMve75ZLvMcsXwrVs3vZ45qEwEt8U8Zb4UexVDcJ8dotORp3mmRjqQ4CoP62RQEdClru9yqnAumUAGsZ%2B6ktZzhVt5yW9wbROyrougIE7Ru6yKN%2BuRzyn0aXt9vP42%2F0IslS12Md2laxIru03%2FfVh68mNB3lzVcT%2FJVET0FVckYKCZyEl4Su2fbknbLdkX%2BQbkAHf%2Fjdk0sALmrcoB8FC8VavB0i6WGKYJgsOp7Q8qUQ1RtBd0oyCbe%2FXS30ZrzHullL8V%2Bg2TPMfbmDAe%2FOYilQZiSKwNttPdueK56DZV88ztC8gKgLu%2FX3nKIi8XEFDEY9ssMW8rBKSCrJ6x%2F8cmiQkDUJWKebULg3GpgeEBmjGAs7e4Af%2BRNIr4bAaQg6IcD1Ogukb%2BZTDyuFXlM%2BbY09bD%2BfV8dkQtMld4brPhOvsnQSl1cQ%2BjRGYgCggfF9TsJ8rUWQaWhi2QlXUTd1S35uxJIol61%2BJzeYkBBRk0Z0UgUor8eQUkNrOSROU5ZHbDvUUDXV3aH6MnidiQC7HEJkQJrehSTWhnbIhMXVm4mrr3SLSpZ%2FH49isNPQ24uIs9Z%2Fpgm0hi9O%2FfX9nr2QXSBIyx9q0K%2FrVgrULEvCpu3%2BaE3IQhaWO14hcLsBagizg8x2ouqap4mIg1s6%2BRZnunHKV4ukJRxGDK%2Fz0UBcF2jYopWWZtPNupV%2BoggtA5TuLueet1nGuXlfL%2Bm8EoUwA%2BqSyafXzMQYwFrPRS6mbvJYgJdrULVMxX89sLk5f6B%2BA9Bx2bmjJr9UQ9%2Fm1UHa%2BIEL5NPHnyxtpngksCiK6ewHYRPf6bPPDLWIdU%2Bv7u7OLt1VcmV4s9trXQ11w02fRiwty7609XkxgOIpcTs6V7nCyjWYu7f6aa7dsw2j5PQ7SUh%2B8gOQRPZAORE5zY00EapSIQTIICqLszfk7pkGEPSojpBNqe5ahTKFsWmHGjbJrxLkwyNJv%2F9cmotnuJ%2BorFVJN0%2FYgnmlNk1bpyoSLLSm0AylbsPAJyAOfxkJKilQ2ieNkDzW0ZEejKEe5Zieq9rx%2Fsaa169q1Sa1UQ0hSOQxSzo36iLaZJTUlNzkQ8mN%2BmYJ%2FwgvxFwEktGqdI5AWnyx985gyxkYY1vSBy3NkCXgJ22%2BtL3IaHa872htB19wk7%2Bz3O4Wiy2nr0iiolJEiPqrBLVwlHr1O74G9audM4rA9yX60bwEc%2Fa0ovDtac9KZrag8efLlPUzC0qhmxm8mqfXQJXlvL8x6VR9zlOFszFgAmCGwcmXhK4%3D&gtnp=0&gtpp=0&kbetu=1&maxads=0&kld=1042&yprpnd=UHM6ofc%2BmzTMdphcWy%2Bzzw%3D%3D&_opnslfp=1&prvtof=YxnOaFd0fqy%2FqegeV7DTyI03rUjdtEBPkt771s7v%2FoOGMM%2BJECey3BbwrKohaG5zXbGblipJzY0UOfjuaVUyP90ZeSNUDjLQjmu2QOE4eKXCHg1U1zwZs%2FliGHSy1Chd4iimRQUeK9BdEm%2B9ovcksg%3D%3D&bkt=14907&d_bkt=14907&skpc=1&&gtnp=0&gtpp=0&kt=362&&kbc=nys+film&ki=20886832&ktd=0&kld=1042&kp=3
2022-11-27 13:13:48 [scrapy.spidermiddlewares.urllength] INFO: Ignoring link (url length > 2083): http://nysfilm.com/Film_School_Blog.cfm?fp=VBQKkpG%2BfGiGvWraXKsuL31rxhqnbknWCd8Hjb%2Fp4SwZTho6kiSlEAmL6tfpg3lmtflBxxF7%2BHZZBokMNJCOyYhbzTj2OIgo7oINmspKWiOr%2BK%2F%2BgNym7lbxSp3sEgOKnF46Gq7CWyeE3Ykm7kdtIWg%2Fv70uN196WanrzG85H%2FS3NvEERcrbVdx9ZroqHRsnEdzxfErXAzbkG4crjYo1kgmNhNFPaTQHk0eOA%2F7N3%2Bjr1ewYkGRxKjwW75%2BHsub1PhP2GZFiawCHjT%2Faa4ynLUye9PFXXseO4J9rBu1PRHYOvJ446rnbuBvomZhjaS2A&yep=KqDMXoygkAGhDKndKjQRQHVL64hoCApvRfAJGNeXARSJPJLJGfiLbPNxKbOqjShz3ZAgaiGMLebyM9tWpnsQXDarStXs1LK6Mj%2F52gzbT5vR7NBGcZdm0TnKW0RwQGJfTwaifOiD71wo43NyuBc08yFm9RWKqQvhDTfr54CuSqEbYN857cyGAi9zWiBNgM35ZfDZuwwqFhfq71Cw7KEnxI6l4%2B%2BdjT6DvUCagXjpAh%2Fuer%2FiwcgVcWHPloB4nrqksp3ggfECd%2B4OjxvKslAwfXZVuopUl3f8bYECOggodXQLO%2BQRqDYBMsE1HG6x1wZeW3in4nKxO5byhlR1RwjBQzgnMIFYRgoFEO9HQODPcgyWyDNFtsWLb2qb93BAQKhIuSLCenmMRzjpIJXQdtagiU4abzvPa1ikO49lEZQQl%2BF%2BR%2BtZIo5FxL%2FCHdsR4cssjJB7O1r5EaTGRjDOUAf0nAbXDcAooWQBXkBL7QUc0brCXa33au65SMve75ZLvMcsXwrVs3vZ45qEwEt8U8Zb4UexVDcJ8dotORp3mmRjqQ4CoP62RQEdClru9yqnAumUAGsZ%2B6ktZzhVt5yW9wbROyrougIE7Ru6yKN%2BuRzyn0aXt9vP42%2F0IslS12Md2laxIru03%2FfVh68mNB3lzVcT%2FJVET0FVckYKCZyEl4Su2fbknbLdkX%2BQbkAHf%2Fjdk0sALmrcoB8FC8VavB0i6WGKYJgsOp7Q8qUQ1RtBd0oyCbe%2FXS30ZrzHullL8V%2Bg2TPMfbmDAe%2FOYilQZiSKwNttPdueK56DZV88ztC8gKgLu%2FX3nKIi8XEFDEY9ssMW8rBKSCrJ6x%2F8cmiQkDUJWKebULg3GpgeEBmjGAs7e4Af%2BRNIr4bAaQg6IcD1Ogukb%2BZTDyuFXlM%2BbY09bD%2BfV8dkQtMld4brPhOvsnQSl1cQ%2BjRGYgCggfF9TsJ8rUWQaWhi2QlXUTd1S35uxJIol61%2BJzeYkBBRk0Z0UgUor8eQUkNrOSROU5ZHbDvUUDXV3aH6MnidiQC7HEJkQJrehSTWhnbIhMXVm4mrr3SLSpZ%2FH49isNPQ24uIs9Z%2Fpgm0hi9O%2FfX9nr2QXSBIyx9q0K%2FrVgrULEvCpu3%2BaE3IQhaWO14hcLsBagizg8x2ouqap4mIg1s6%2BRZnunHKV4ukJRxGDK%2Fz0UBcF2jYopWWZtPNupV%2BoggtA5TuLueet1nGuXlfL%2Bm8EoUwA%2BqSyafXzMQYwFrPRS6mbvJYgJdrULVMxX89sLk5f6B%2BA9Bx2bmjJr9UQ9%2Fm1UHa%2BIEL5NPHnyxtpngksCiK6ewHYRPf6bPPDLWIdU%2Bv7u7OLt1VcmV4s9trXQ11w02fRiwty7609XkxgOIpcTs6V7nCyjWYu7f6aa7dsw2j5PQ7SUh%2B8gOQRPZAORE5zY00EapSIQTIICqLszfk7pkGEPSojpBNqe5ahTKFsWmHGjbJrxLkwyNJv%2F9cmotnuJ%2BorFVJN0%2FYgnmlNk1bpyoSLLSm0AylbsPAJyAOfxkJKilQ2ieNkDzW0ZEejKEe5Zieq9rx%2Fsaa169q1Sa1UQ0hSOQxSzo36iLaZJTUlNzkQ8mN%2BmYJ%2FwgvxFwEktGqdI5AWnyx985gyxkYY1vSBy3NkCXgJ22%2BtL3IaHa872htB19wk7%2Bz3O4Wiy2nr0iiolJEiPqrBLVwlHr1O74G9audM4rA9yX60bwEc%2Fa0ovDtac9KZrag8efLlPUzC0qhmxm8mqfXQJXlvL8x6VR9zlOFszFgAmCGwcmXhK4%3D&gtnp=0&gtpp=0&kbetu=1&maxads=0&kld=1042&yprpnd=UHM6ofc%2BmzTMdphcWy%2Bzzw%3D%3D&_opnslfp=1&prvtof=YxnOaFd0fqy%2FqegeV7DTyI03rUjdtEBPkt771s7v%2FoOGMM%2BJECey3BbwrKohaG5zXbGblipJzY0UOfjuaVUyP90ZeSNUDjLQjmu2QOE4eKXCHg1U1zwZs%2FliGHSy1Chd4iimRQUeK9BdEm%2B9ovcksg%3D%3D&bkt=14907&d_bkt=14907&skpc=1&&gtnp=0&gtpp=0&kt=362&&kbc=nys+film&ki=83400822&ktd=0&kld=1042&kp=4
2022-11-27 13:13:48 [scrapy.spidermiddlewares.urllength] INFO: Ignoring link (url length > 2083): http://nysfilm.com/2019_Action_Movies.cfm?fp=VBQKkpG%2BfGiGvWraXKsuL31rxhqnbknWCd8Hjb%2Fp4SwZTho6kiSlEAmL6tfpg3lmtflBxxF7%2BHZZBokMNJCOyYhbzTj2OIgo7oINmspKWiOr%2BK%2F%2BgNym7lbxSp3sEgOKnF46Gq7CWyeE3Ykm7kdtIWg%2Fv70uN196WanrzG85H%2FS3NvEERcrbVdx9ZroqHRsnEdzxfErXAzbkG4crjYo1kgmNhNFPaTQHk0eOA%2F7N3%2Bjr1ewYkGRxKjwW75%2BHsub1PhP2GZFiawCHjT%2Faa4ynLUye9PFXXseO4J9rBu1PRHYOvJ446rnbuBvomZhjaS2A&yep=KqDMXoygkAGhDKndKjQRQHVL64hoCApvRfAJGNeXARSJPJLJGfiLbPNxKbOqjShz3ZAgaiGMLebyM9tWpnsQXDarStXs1LK6Mj%2F52gzbT5vR7NBGcZdm0TnKW0RwQGJfTwaifOiD71wo43NyuBc08yFm9RWKqQvhDTfr54CuSqEbYN857cyGAi9zWiBNgM35ZfDZuwwqFhfq71Cw7KEnxI6l4%2B%2BdjT6DvUCagXjpAh%2Fuer%2FiwcgVcWHPloB4nrqksp3ggfECd%2B4OjxvKslAwfXZVuopUl3f8bYECOggodXQLO%2BQRqDYBMsE1HG6x1wZeW3in4nKxO5byhlR1RwjBQzgnMIFYRgoFEO9HQODPcgyWyDNFtsWLb2qb93BAQKhIuSLCenmMRzjpIJXQdtagiU4abzvPa1ikO49lEZQQl%2BF%2BR%2BtZIo5FxL%2FCHdsR4cssjJB7O1r5EaTGRjDOUAf0nAbXDcAooWQBXkBL7QUc0brCXa33au65SMve75ZLvMcsXwrVs3vZ45qEwEt8U8Zb4UexVDcJ8dotORp3mmRjqQ4CoP62RQEdClru9yqnAumUAGsZ%2B6ktZzhVt5yW9wbROyrougIE7Ru6yKN%2BuRzyn0aXt9vP42%2F0IslS12Md2laxIru03%2FfVh68mNB3lzVcT%2FJVET0FVckYKCZyEl4Su2fbknbLdkX%2BQbkAHf%2Fjdk0sALmrcoB8FC8VavB0i6WGKYJgsOp7Q8qUQ1RtBd0oyCbe%2FXS30ZrzHullL8V%2Bg2TPMfbmDAe%2FOYilQZiSKwNttPdueK56DZV88ztC8gKgLu%2FX3nKIi8XEFDEY9ssMW8rBKSCrJ6x%2F8cmiQkDUJWKebULg3GpgeEBmjGAs7e4Af%2BRNIr4bAaQg6IcD1Ogukb%2BZTDyuFXlM%2BbY09bD%2BfV8dkQtMld4brPhOvsnQSl1cQ%2BjRGYgCggfF9TsJ8rUWQaWhi2QlXUTd1S35uxJIol61%2BJzeYkBBRk0Z0UgUor8eQUkNrOSROU5ZHbDvUUDXV3aH6MnidiQC7HEJkQJrehSTWhnbIhMXVm4mrr3SLSpZ%2FH49isNPQ24uIs9Z%2Fpgm0hi9O%2FfX9nr2QXSBIyx9q0K%2FrVgrULEvCpu3%2BaE3IQhaWO14hcLsBagizg8x2ouqap4mIg1s6%2BRZnunHKV4ukJRxGDK%2Fz0UBcF2jYopWWZtPNupV%2BoggtA5TuLueet1nGuXlfL%2Bm8EoUwA%2BqSyafXzMQYwFrPRS6mbvJYgJdrULVMxX89sLk5f6B%2BA9Bx2bmjJr9UQ9%2Fm1UHa%2BIEL5NPHnyxtpngksCiK6ewHYRPf6bPPDLWIdU%2Bv7u7OLt1VcmV4s9trXQ11w02fRiwty7609XkxgOIpcTs6V7nCyjWYu7f6aa7dsw2j5PQ7SUh%2B8gOQRPZAORE5zY00EapSIQTIICqLszfk7pkGEPSojpBNqe5ahTKFsWmHGjbJrxLkwyNJv%2F9cmotnuJ%2BorFVJN0%2FYgnmlNk1bpyoSLLSm0AylbsPAJyAOfxkJKilQ2ieNkDzW0ZEejKEe5Zieq9rx%2Fsaa169q1Sa1UQ0hSOQxSzo36iLaZJTUlNzkQ8mN%2BmYJ%2FwgvxFwEktGqdI5AWnyx985gyxkYY1vSBy3NkCXgJ22%2BtL3IaHa872htB19wk7%2Bz3O4Wiy2nr0iiolJEiPqrBLVwlHr1O74G9audM4rA9yX60bwEc%2Fa0ovDtac9KZrag8efLlPUzC0qhmxm8mqfXQJXlvL8x6VR9zlOFszFgAmCGwcmXhK4%3D&gtnp=0&gtpp=0&kbetu=1&maxads=0&kld=1042&yprpnd=UHM6ofc%2BmzTMdphcWy%2Bzzw%3D%3D&_opnslfp=1&prvtof=YxnOaFd0fqy%2FqegeV7DTyI03rUjdtEBPkt771s7v%2FoOGMM%2BJECey3BbwrKohaG5zXbGblipJzY0UOfjuaVUyP90ZeSNUDjLQjmu2QOE4eKXCHg1U1zwZs%2FliGHSy1Chd4iimRQUeK9BdEm%2B9ovcksg%3D%3D&bkt=14907&d_bkt=14907&skpc=1&&gtnp=0&gtpp=0&kt=362&&kbc=nys+film&ki=886661&ktd=0&kld=1042&kp=5
2022-11-27 13:14:23 [scrapy.extensions.logstats] INFO: Crawled 28542 pages (at 2126 pages/min), scraped 273 items (at 120 items/min)
2022-11-27 13:15:23 [scrapy.extensions.logstats] INFO: Crawled 28544 pages (at 2 pages/min), scraped 273 items (at 0 items/min)
2022-11-27 13:16:23 [scrapy.extensions.logstats] INFO: Crawled 28544 pages (at 0 pages/min), scraped 273 items (at 0 items/min)
2022-11-27 13:16:33 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <HEAD http://cdn.esd.ny.gov/smallbusiness/Data/2010_SmBizBudgetaryUpdate.pdf> (failed 3 times): TCP connection timed out: 60: Operation timed out.
2022-11-27 13:16:33 [scrapy.core.scraper] ERROR: Error downloading <HEAD http://cdn.esd.ny.gov/smallbusiness/Data/2010_SmBizBudgetaryUpdate.pdf>
Traceback (most recent call last):
  File "/Users/afeld/Library/Caches/pypoetry/virtualenvs/esd-crawl-fJL88Oq4-py3.10/lib/python3.10/site-packages/scrapy/core/downloader/middleware.py", line 49, in process_request
    return (yield download_func(request=request, spider=spider))
twisted.internet.error.TCPTimedOutError: TCP connection timed out: 60: Operation timed out.
2022-11-27 13:16:34 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <HEAD http://cdn.esd.ny.gov/smallbusiness/Data/111611_RegulatoryChangesSmall%20BusinessesJuly2010June2011.pdf> (failed 3 times): TCP connection timed out: 60: Operation timed out.
2022-11-27 13:16:34 [scrapy.core.scraper] ERROR: Error downloading <HEAD http://cdn.esd.ny.gov/smallbusiness/Data/111611_RegulatoryChangesSmall%20BusinessesJuly2010June2011.pdf>
Traceback (most recent call last):
  File "/Users/afeld/Library/Caches/pypoetry/virtualenvs/esd-crawl-fJL88Oq4-py3.10/lib/python3.10/site-packages/scrapy/core/downloader/middleware.py", line 49, in process_request
    return (yield download_func(request=request, spider=spider))
twisted.internet.error.TCPTimedOutError: TCP connection timed out: 60: Operation timed out.
2022-11-27 13:16:34 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <HEAD http://cdn.esd.ny.gov/smallbusiness/Data/2016RegulatoryChangesAffectingSmallBusinesses.pdf> (failed 3 times): TCP connection timed out: 60: Operation timed out.
2022-11-27 13:16:34 [scrapy.core.scraper] ERROR: Error downloading <HEAD http://cdn.esd.ny.gov/smallbusiness/Data/2016RegulatoryChangesAffectingSmallBusinesses.pdf>
Traceback (most recent call last):
  File "/Users/afeld/Library/Caches/pypoetry/virtualenvs/esd-crawl-fJL88Oq4-py3.10/lib/python3.10/site-packages/scrapy/core/downloader/middleware.py", line 49, in process_request
    return (yield download_func(request=request, spider=spider))
twisted.internet.error.TCPTimedOutError: TCP connection timed out: 60: Operation timed out.
2022-11-27 13:16:34 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <HEAD http://cdn.esd.ny.gov/smallbusiness/Data/2010RegulatoryUpdate.pdf> (failed 3 times): TCP connection timed out: 60: Operation timed out.
2022-11-27 13:16:34 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <HEAD http://cdn.esd.ny.gov/smallbusiness/Data/2009RegulatoryUpdate.pdf> (failed 3 times): TCP connection timed out: 60: Operation timed out.
2022-11-27 13:16:34 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <HEAD http://cdn.esd.ny.gov/smallbusiness/Data/2015_LegislativeChangesAffectingSmallBusinesses.pdf> (failed 3 times): TCP connection timed out: 60: Operation timed out.
2022-11-27 13:16:34 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <HEAD http://cdn.esd.ny.gov/smallbusiness/Data/111611_LegislativeChangesSmall%20Businesses2011LegislativeSessionv3.pdf> (failed 3 times): TCP connection timed out: 60: Operation timed out.
2022-11-27 13:16:34 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <HEAD http://cdn.esd.ny.gov/smallbusiness/Data/2014LegislativeChangesAffectingSmallBusinesses.pdf> (failed 3 times): TCP connection timed out: 60: Operation timed out.
2022-11-27 13:16:34 [scrapy.core.scraper] ERROR: Error downloading <HEAD http://cdn.esd.ny.gov/smallbusiness/Data/2010RegulatoryUpdate.pdf>
Traceback (most recent call last):
  File "/Users/afeld/Library/Caches/pypoetry/virtualenvs/esd-crawl-fJL88Oq4-py3.10/lib/python3.10/site-packages/scrapy/core/downloader/middleware.py", line 49, in process_request
    return (yield download_func(request=request, spider=spider))
twisted.internet.error.TCPTimedOutError: TCP connection timed out: 60: Operation timed out.
2022-11-27 13:16:34 [scrapy.core.scraper] ERROR: Error downloading <HEAD http://cdn.esd.ny.gov/smallbusiness/Data/2009RegulatoryUpdate.pdf>
Traceback (most recent call last):
  File "/Users/afeld/Library/Caches/pypoetry/virtualenvs/esd-crawl-fJL88Oq4-py3.10/lib/python3.10/site-packages/scrapy/core/downloader/middleware.py", line 49, in process_request
    return (yield download_func(request=request, spider=spider))
twisted.internet.error.TCPTimedOutError: TCP connection timed out: 60: Operation timed out.
2022-11-27 13:16:34 [scrapy.core.scraper] ERROR: Error downloading <HEAD http://cdn.esd.ny.gov/smallbusiness/Data/2015_LegislativeChangesAffectingSmallBusinesses.pdf>
Traceback (most recent call last):
  File "/Users/afeld/Library/Caches/pypoetry/virtualenvs/esd-crawl-fJL88Oq4-py3.10/lib/python3.10/site-packages/scrapy/core/downloader/middleware.py", line 49, in process_request
    return (yield download_func(request=request, spider=spider))
twisted.internet.error.TCPTimedOutError: TCP connection timed out: 60: Operation timed out.
2022-11-27 13:16:34 [scrapy.core.scraper] ERROR: Error downloading <HEAD http://cdn.esd.ny.gov/smallbusiness/Data/111611_LegislativeChangesSmall%20Businesses2011LegislativeSessionv3.pdf>
Traceback (most recent call last):
  File "/Users/afeld/Library/Caches/pypoetry/virtualenvs/esd-crawl-fJL88Oq4-py3.10/lib/python3.10/site-packages/scrapy/core/downloader/middleware.py", line 49, in process_request
    return (yield download_func(request=request, spider=spider))
twisted.internet.error.TCPTimedOutError: TCP connection timed out: 60: Operation timed out.
2022-11-27 13:16:34 [scrapy.core.scraper] ERROR: Error downloading <HEAD http://cdn.esd.ny.gov/smallbusiness/Data/2014LegislativeChangesAffectingSmallBusinesses.pdf>
Traceback (most recent call last):
  File "/Users/afeld/Library/Caches/pypoetry/virtualenvs/esd-crawl-fJL88Oq4-py3.10/lib/python3.10/site-packages/scrapy/core/downloader/middleware.py", line 49, in process_request
    return (yield download_func(request=request, spider=spider))
twisted.internet.error.TCPTimedOutError: TCP connection timed out: 60: Operation timed out.
2022-11-27 13:16:41 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <HEAD http://cdn.esd.ny.gov/smallbusiness/Data/2013_SmallBizLegislativeChanges.pdf> (failed 3 times): TCP connection timed out: 60: Operation timed out.
2022-11-27 13:16:41 [scrapy.core.scraper] ERROR: Error downloading <HEAD http://cdn.esd.ny.gov/smallbusiness/Data/2013_SmallBizLegislativeChanges.pdf>
Traceback (most recent call last):
  File "/Users/afeld/Library/Caches/pypoetry/virtualenvs/esd-crawl-fJL88Oq4-py3.10/lib/python3.10/site-packages/scrapy/core/downloader/middleware.py", line 49, in process_request
    return (yield download_func(request=request, spider=spider))
twisted.internet.error.TCPTimedOutError: TCP connection timed out: 60: Operation timed out.
2022-11-27 13:16:42 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <HEAD http://cdn.esd.ny.gov/smallbusiness/Data/2010LegislativeUpdate.pdf> (failed 3 times): TCP connection timed out: 60: Operation timed out.
2022-11-27 13:16:42 [scrapy.core.scraper] ERROR: Error downloading <HEAD http://cdn.esd.ny.gov/smallbusiness/Data/2010LegislativeUpdate.pdf>
Traceback (most recent call last):
  File "/Users/afeld/Library/Caches/pypoetry/virtualenvs/esd-crawl-fJL88Oq4-py3.10/lib/python3.10/site-packages/scrapy/core/downloader/middleware.py", line 49, in process_request
    return (yield download_func(request=request, spider=spider))
twisted.internet.error.TCPTimedOutError: TCP connection timed out: 60: Operation timed out.
2022-11-27 13:16:42 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <HEAD http://cdn.esd.ny.gov/smallbusiness/Data/2009LegislativeUpdate.pdf> (failed 3 times): TCP connection timed out: 60: Operation timed out.
2022-11-27 13:16:42 [scrapy.core.scraper] ERROR: Error downloading <HEAD http://cdn.esd.ny.gov/smallbusiness/Data/2009LegislativeUpdate.pdf>
Traceback (most recent call last):
  File "/Users/afeld/Library/Caches/pypoetry/virtualenvs/esd-crawl-fJL88Oq4-py3.10/lib/python3.10/site-packages/scrapy/core/downloader/middleware.py", line 49, in process_request
    return (yield download_func(request=request, spider=spider))
twisted.internet.error.TCPTimedOutError: TCP connection timed out: 60: Operation timed out.
2022-11-27 13:16:42 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <HEAD http://cdn.esd.ny.gov/smallbusiness/Data/2013_SmallBizRegulatoryChanges.pdf> (failed 3 times): TCP connection timed out: 60: Operation timed out.
2022-11-27 13:16:42 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <HEAD http://cdn.esd.ny.gov/smallbusiness/Data/2015_RegulatoryChangesAffectingSmallBusinesses.pdf> (failed 3 times): TCP connection timed out: 60: Operation timed out.
2022-11-27 13:16:42 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <HEAD http://cdn.esd.ny.gov/smallbusiness/Data/2014_RegulatoryChangesAffectingSmallBusinesses.pdf> (failed 3 times): TCP connection timed out: 60: Operation timed out.
2022-11-27 13:16:42 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <HEAD http://cdn.esd.ny.gov/smallbusiness/Data/2014BudgetaryChangesAffectingSmallBusinesses.pdf> (failed 3 times): TCP connection timed out: 60: Operation timed out.
2022-11-27 13:16:42 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <HEAD http://cdn.esd.ny.gov/smallbusiness/Data/111611_BudgetaryChangesSmallBusinesses11_12.pdf> (failed 3 times): TCP connection timed out: 60: Operation timed out.
2022-11-27 13:16:42 [scrapy.core.scraper] ERROR: Error downloading <HEAD http://cdn.esd.ny.gov/smallbusiness/Data/2013_SmallBizRegulatoryChanges.pdf>
Traceback (most recent call last):
  File "/Users/afeld/Library/Caches/pypoetry/virtualenvs/esd-crawl-fJL88Oq4-py3.10/lib/python3.10/site-packages/scrapy/core/downloader/middleware.py", line 49, in process_request
    return (yield download_func(request=request, spider=spider))
twisted.internet.error.TCPTimedOutError: TCP connection timed out: 60: Operation timed out.
2022-11-27 13:16:42 [scrapy.core.scraper] ERROR: Error downloading <HEAD http://cdn.esd.ny.gov/smallbusiness/Data/2015_RegulatoryChangesAffectingSmallBusinesses.pdf>
Traceback (most recent call last):
  File "/Users/afeld/Library/Caches/pypoetry/virtualenvs/esd-crawl-fJL88Oq4-py3.10/lib/python3.10/site-packages/scrapy/core/downloader/middleware.py", line 49, in process_request
    return (yield download_func(request=request, spider=spider))
twisted.internet.error.TCPTimedOutError: TCP connection timed out: 60: Operation timed out.
2022-11-27 13:16:42 [scrapy.core.scraper] ERROR: Error downloading <HEAD http://cdn.esd.ny.gov/smallbusiness/Data/2014_RegulatoryChangesAffectingSmallBusinesses.pdf>
Traceback (most recent call last):
  File "/Users/afeld/Library/Caches/pypoetry/virtualenvs/esd-crawl-fJL88Oq4-py3.10/lib/python3.10/site-packages/scrapy/core/downloader/middleware.py", line 49, in process_request
    return (yield download_func(request=request, spider=spider))
twisted.internet.error.TCPTimedOutError: TCP connection timed out: 60: Operation timed out.
2022-11-27 13:16:42 [scrapy.core.scraper] ERROR: Error downloading <HEAD http://cdn.esd.ny.gov/smallbusiness/Data/2014BudgetaryChangesAffectingSmallBusinesses.pdf>
Traceback (most recent call last):
  File "/Users/afeld/Library/Caches/pypoetry/virtualenvs/esd-crawl-fJL88Oq4-py3.10/lib/python3.10/site-packages/scrapy/core/downloader/middleware.py", line 49, in process_request
    return (yield download_func(request=request, spider=spider))
twisted.internet.error.TCPTimedOutError: TCP connection timed out: 60: Operation timed out.
2022-11-27 13:16:42 [scrapy.core.scraper] ERROR: Error downloading <HEAD http://cdn.esd.ny.gov/smallbusiness/Data/111611_BudgetaryChangesSmallBusinesses11_12.pdf>
Traceback (most recent call last):
  File "/Users/afeld/Library/Caches/pypoetry/virtualenvs/esd-crawl-fJL88Oq4-py3.10/lib/python3.10/site-packages/scrapy/core/downloader/middleware.py", line 49, in process_request
    return (yield download_func(request=request, spider=spider))
twisted.internet.error.TCPTimedOutError: TCP connection timed out: 60: Operation timed out.
2022-11-27 13:16:51 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <HEAD http://10.74.80.40/CorporateInformation/Data/ILNY_HospitalitySuiteReport2015.pdf> (failed 3 times): TCP connection timed out: 60: Operation timed out.
2022-11-27 13:16:52 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <HEAD http://10.74.80.40/CorporateInformation/Data/ESD_ilovenyhospitalitysuiteUpdateFinalReport.pdf> (failed 3 times): TCP connection timed out: 60: Operation timed out.
2022-11-27 13:16:52 [scrapy.core.scraper] ERROR: Error downloading <HEAD http://10.74.80.40/CorporateInformation/Data/ILNY_HospitalitySuiteReport2015.pdf>
Traceback (most recent call last):
  File "/Users/afeld/Library/Caches/pypoetry/virtualenvs/esd-crawl-fJL88Oq4-py3.10/lib/python3.10/site-packages/scrapy/core/downloader/middleware.py", line 49, in process_request
    return (yield download_func(request=request, spider=spider))
twisted.internet.error.TCPTimedOutError: TCP connection timed out: 60: Operation timed out.
2022-11-27 13:16:52 [scrapy.core.scraper] ERROR: Error downloading <HEAD http://10.74.80.40/CorporateInformation/Data/ESD_ilovenyhospitalitysuiteUpdateFinalReport.pdf>
Traceback (most recent call last):
  File "/Users/afeld/Library/Caches/pypoetry/virtualenvs/esd-crawl-fJL88Oq4-py3.10/lib/python3.10/site-packages/scrapy/core/downloader/middleware.py", line 49, in process_request
    return (yield download_func(request=request, spider=spider))
twisted.internet.error.TCPTimedOutError: TCP connection timed out: 60: Operation timed out.
2022-11-27 13:17:13 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <HEAD http://10.74.80.40/NewsRoom/Data/2012/SEEDprogressrelease_final.pdf> (failed 3 times): TCP connection timed out: 60: Operation timed out.
2022-11-27 13:17:13 [scrapy.core.scraper] ERROR: Error downloading <HEAD http://10.74.80.40/NewsRoom/Data/2012/SEEDprogressrelease_final.pdf>
Traceback (most recent call last):
  File "/Users/afeld/Library/Caches/pypoetry/virtualenvs/esd-crawl-fJL88Oq4-py3.10/lib/python3.10/site-packages/scrapy/core/downloader/middleware.py", line 49, in process_request
    return (yield download_func(request=request, spider=spider))
twisted.internet.error.TCPTimedOutError: TCP connection timed out: 60: Operation timed out.
2022-11-27 13:17:23 [scrapy.extensions.logstats] INFO: Crawled 28544 pages (at 0 pages/min), scraped 273 items (at 0 items/min)
2022-11-27 13:17:40 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <HEAD http://10.74.80.40/CorporateInformation/Data/ILNY_HospitalitySuiteReport2014.pdf> (failed 3 times): TCP connection timed out: 60: Operation timed out.
2022-11-27 13:17:40 [scrapy.core.scraper] ERROR: Error downloading <HEAD http://10.74.80.40/CorporateInformation/Data/ILNY_HospitalitySuiteReport2014.pdf>
Traceback (most recent call last):
  File "/Users/afeld/Library/Caches/pypoetry/virtualenvs/esd-crawl-fJL88Oq4-py3.10/lib/python3.10/site-packages/scrapy/core/downloader/middleware.py", line 49, in process_request
    return (yield download_func(request=request, spider=spider))
twisted.internet.error.TCPTimedOutError: TCP connection timed out: 60: Operation timed out.
2022-11-27 13:17:41 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <HEAD https://www.usaniagara.com/pdfs/pressreleases/USANpressRel-03022017.pdf> (failed 3 times): TCP connection timed out: 60: Operation timed out.
2022-11-27 13:17:41 [scrapy.core.scraper] ERROR: Error downloading <HEAD https://www.usaniagara.com/pdfs/pressreleases/USANpressRel-03022017.pdf>
Traceback (most recent call last):
  File "/Users/afeld/Library/Caches/pypoetry/virtualenvs/esd-crawl-fJL88Oq4-py3.10/lib/python3.10/site-packages/scrapy/core/downloader/middleware.py", line 49, in process_request
    return (yield download_func(request=request, spider=spider))
twisted.internet.error.TCPTimedOutError: TCP connection timed out: 60: Operation timed out.
2022-11-27 13:17:41 [scrapy.core.engine] INFO: Closing spider (finished)
2022-11-27 13:17:41 [scrapy.extensions.feedexport] INFO: Stored csv feed (273 items) in: results/broken.csv
2022-11-27 13:17:41 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 66,
 'downloader/exception_type_count/builtins.ValueError': 3,
 'downloader/exception_type_count/twisted.internet.error.TCPTimedOutError': 63,
 'downloader/request_bytes': 11860613,
 'downloader/request_count': 32429,
 'downloader/request_method_count/GET': 6593,
 'downloader/request_method_count/HEAD': 25836,
 'downloader/response_bytes': 91093464,
 'downloader/response_count': 32363,
 'downloader/response_status_count/200': 28457,
 'downloader/response_status_count/301': 3773,
 'downloader/response_status_count/302': 46,
 'downloader/response_status_count/400': 1,
 'downloader/response_status_count/403': 42,
 'downloader/response_status_count/404': 44,
 'dupefilter/filtered': 339546,
 'elapsed_time_seconds': 917.719655,
 'feedexport/success_count/FileFeedStorage': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2022, 11, 27, 19, 17, 41, 664509),
 'httpcache/firsthand': 15463,
 'httpcache/hit': 16886,
 'httpcache/invalidate': 13,
 'httpcache/miss': 15529,
 'httpcache/revalidate': 1,
 'httpcache/store': 15368,
 'httpcache/uncacheable': 108,
 'httpcompression/response_bytes': 308012346,
 'httpcompression/response_count': 6330,
 'item_scraped_count': 273,
 'log_count/ERROR': 45,
 'log_count/INFO': 32,
 'log_count/WARNING': 3,
 'memusage/max': 134021120,
 'memusage/startup': 65880064,
 'offsite/domains': 1204,
 'offsite/filtered': 36966,
 'request_depth_max': 57,
 'response_received_count': 28544,
 'retry/count': 42,
 'retry/max_reached': 21,
 'retry/reason_count/twisted.internet.error.TCPTimedOutError': 42,
 'scheduler/dequeued': 32429,
 'scheduler/dequeued/memory': 32429,
 'scheduler/enqueued': 32429,
 'scheduler/enqueued/memory': 32429,
 'start_time': datetime.datetime(2022, 11, 27, 19, 2, 23, 944854),
 'urllength/request_ignored_count': 6}
2022-11-27 13:17:41 [scrapy.core.engine] INFO: Spider closed (finished)
scrapy runspider esd_crawl/spiders/broken.py -L INFO -O results/broken.csv  190.29s user 16.74s system 22% cpu 15:18.24 total
afeld commented 1 year ago

Script to check that it contains all links from the previous results:

from csv import DictReader

urls = set()

with open(
    "/Users/afeld/Downloads/broken links - crawler findings - PDFs - 2022-11-17.csv"
) as f:
    reader = DictReader(f)
    for row in reader:
        urls.add(row["url"])

with open("results/broken.csv") as f:
    reader = DictReader(f)
    for row in reader:
        url = row["url"]
        if url in urls:
            urls.remove(url)

print(urls)