reinventalbany / esd-crawl

Web crawler to find data on Empire State Development site
MIT License
0 stars 0 forks source link

exclude application forms #27

Closed afeld closed 2 years ago

afeld commented 2 years ago

The site has PDFs that are application forms, e.g.

These should be skipped, or perhaps marked with Keep=No.

afeld commented 2 years ago

Created a View in Airtable that filters to PDF URLs that contain "application", and it's quick to scan through those and mark them as Keep=No.

cc https://github.com/reinventalbany/esd-crawl/issues/24